The Reasonable Effectiveness of the Multiplicative Weights Update Algorithm

Christos Papadimitriou, who studies multiplicative weights in the context of biology.

Hard to believe

Sanjeev Arora and his coauthors consider it “a basic tool [that should be] taught to all algorithms students together with divide-and-conquer, dynamic programming, and random sampling.” Christos Papadimitriou calls it “so hard to believe that it has been discovered five times and forgotten.” It has formed the basis of algorithms in machine learning, optimization, game theory, economics, biology, and more.

What mystical algorithm has such broad applications? Now that computer scientists have studied it in generality, it’s known as the Multiplicative Weights Update Algorithm (MWUA). Procedurally, the algorithm is simple. I can even describe the core idea in six lines of pseudocode. You start with a collection of $n$ objects, and each object has a weight.

Set all the object weights to be 1.
For some large number of rounds:
Pick an object at random proportionally to the weights
Some event happens
Increase the weight of the chosen object if it does well in the event
Otherwise decrease the weight

The name “multiplicative weights” comes from how we implement the last step: if the weight of the chosen object at step $t$ is $w_t$ before the event, and $G$ represents how well the object did in the event, then we’ll update the weight according to the rule:

$\displaystyle w_{t+1} = w_t (1 + G)$

Think of this as increasing the weight by a small multiple of the object’s performance on a given round.

Here is a simple example of how it might be used. You have some money you want to invest, and you have a bunch of financial experts who are telling you what to invest in every day. So each day you pick an expert, and you follow their advice, and you either make a thousand dollars, or you lose a thousand dollars, or something in between. Then you repeat, and your goal is to figure out which expert is the most reliable.

This is how we use multiplicative weights: if we number the experts $1, \dots, N$, we give each expert a weight $w_i$ which starts at 1. Then, each day we pick an expert at random (where experts with larger weights are more likely to be picked) and at the end of the day we have some gain or loss $G$. Then we update the weight of the chosen expert by multiplying it by $(1 + G / 1000)$. Sometimes you have enough information to update the weights of experts you didn’t choose, too. The theoretical guarantees of the algorithm say we’ll find the best expert quickly (“quickly” will be concrete later).

In fact, let’s play a game where you, dear reader, get to decide the rewards for each expert and each day. I programmed the multiplicative weights algorithm to react according to your choices. Click the image below to go to the demo.

This core mechanism of updating weights can be interpreted in many ways, and that’s part of the reason it has sprouted up all over mathematics and computer science. Just a few examples of where this has led:

1. In game theory, weights are the “belief” of a player about the strategy of an opponent. The most famous algorithm to use this is called Fictitious Play, and others include EXP3 for minimizing regret in the so-called “adversarial bandit learning” problem.
2. In machine learning, weights are the difficulty of a specific training example, so that higher weights mean the learning algorithm has to “try harder” to accommodate that example. The first result I’m aware of for this is the Perceptron (and similar Winnow) algorithm for learning hyperplane separators. The most famous is the AdaBoost algorithm.
3. Analogously, in optimization, the weights are the difficulty of a specific constraint, and this technique can be used to approximately solve linear and semidefinite programs. The approximation is because MWUA only provides a solution with some error.
4. In mathematical biology, the weights represent the fitness of individual alleles, and filtering reproductive success based on this and updating weights for successful organisms produces a mechanism very much like evolution. With modifications, it also provides a mechanism through which to understand sex in the context of evolutionary biology.
5. The TCP protocol, which basically defined the internet, uses additive and multiplicative weight updates (which are very similar in the analysis) to manage congestion.
6. You can get easy $\log(n)$-approximation algorithms for many NP-hard problems, such as set cover.

Additional, more technical examples can be found in this survey of Arora et al.

In the rest of this post, we’ll implement a generic Multiplicative Weights Update Algorithm, we’ll prove it’s main theoretical guarantees, and we’ll implement a linear program solver as an example of its applicability. As usual, all of the code used in the making of this post is available in a Github repository.

The generic MWUA algorithm

Let’s start by writing down pseudocode and an implementation for the MWUA algorithm in full generality.

In general we have some set $X$ of objects and some set $Y$ of “event outcomes” which can be completely independent. If these sets are finite, we can write down a table $M$ whose rows are objects, whose columns are outcomes, and whose $i,j$ entry $M(i,j)$ is the reward produced by object $x_i$ when the outcome is $y_j$. We will also write this as $M(x, y)$ for object $x$ and outcome $y$. The only assumption we’ll make on the rewards is that the values $M(x, y)$ are bounded by some small constant $B$ (by small I mean $B$ should not require exponentially many bits to write down as compared to the size of $X$). In symbols, $M(x,y) \in [0,B]$. There are minor modifications you can make to the algorithm if you want negative rewards, but for simplicity we will leave that out. Note the table $M$ just exists for analysis, and the algorithm does not know its values. Moreover, while the values in $M$ are static, the choice of outcome $y$ for a given round may be nondeterministic.

The MWUA algorithm randomly chooses an object $x \in X$ in every round, observing the outcome $y \in Y$, and collecting the reward $M(x,y)$ (or losing it as a penalty). The guarantee of the MWUA theorem is that the expected sum of rewards/penalties of MWUA is not much worse than if one had picked the best object (in hindsight) every single round.

Let’s describe the algorithm in notation first and build up pseudocode as we go. The input to the algorithm is the set of objects, a subroutine that observes an outcome, a black-box reward function, a learning rate parameter, and a number of rounds.

def MWUA(objects, observeOutcome, reward, learningRate, numRounds):
...


We define for object $x$ a nonnegative number $w_x$ we call a “weight.” The weights will change over time so we’ll also sub-script a weight with a round number $t$, i.e. $w_{x,t}$ is the weight of object $x$ in round $t$. Initially, all the weights are $1$. Then MWUA continues in rounds. We start each round by drawing an example randomly with probability proportional to the weights. Then we observe the outcome for that round and the reward for that round.

# draw: [float] -> int
# pick an index from the given list of floats proportionally
# to the size of the entry (i.e. normalize to a probability
# distribution and draw according to the probabilities).
def draw(weights):
choice = random.uniform(0, sum(weights))
choiceIndex = 0

for weight in weights:
choice -= weight
if choice <= 0:
return choiceIndex

choiceIndex += 1

# MWUA: the multiplicative weights update algorithm
def MWUA(objects, observeOutcome, reward, learningRate numRounds):
weights = [1] * len(objects)
for t in numRounds:
chosenObjectIndex = draw(weights)
chosenObject = objects[chosenObjectIndex]

outcome = observeOutcome(t, weights, chosenObject)
thisRoundReward = reward(chosenObject, outcome)

...


Sampling objects in this way is the same as associating a distribution $D_t$ to each round, where if $S_t = \sum_{x \in X} w_{x,t}$ then the probability of drawing $x$, which we denote $D_t(x)$, is $w_{x,t} / S_t$. We don’t need to keep track of this distribution in the actual run of the algorithm, but it will help us with the mathematical analysis.

Next comes the weight update step. Let’s call our learning rate variable parameter $\varepsilon$. In round $t$ say we have object $x_t$ and outcome $y_t$, then the reward is $M(x_t, y_t)$. We update the weight of the chosen object $x_t$ according to the formula:

$\displaystyle w_{x_t, t} = w_{x_t} (1 + \varepsilon M(x_t, y_t) / B)$

In the more general event that you have rewards for all objects (if not, the reward-producing function can output zero), you would perform this weight update on all objects $x \in X$. This turns into the following Python snippet, where we hide the division by $B$ into the choice of learning rate:

# MWUA: the multiplicative weights update algorithm
def MWUA(objects, observeOutcome, reward, learningRate, numRounds):
weights = [1] * len(objects)
for t in numRounds:
chosenObjectIndex = draw(weights)
chosenObject = objects[chosenObjectIndex]

outcome = observeOutcome(t, weights, chosenObject)
thisRoundReward = reward(chosenObject, outcome)

for i in range(len(weights)):
weights[i] *= (1 + learningRate * reward(objects[i], outcome))


One of the amazing things about this algorithm is that the outcomes and rewards could be chosen adaptively by an adversary who knows everything about the MWUA algorithm (except which random numbers the algorithm generates to make its choices). This means that the rewards in round $t$ can depend on the weights in that same round! We will exploit this when we solve linear programs later in this post.

But even in such an oppressive, exploitative environment, MWUA persists and achieves its guarantee. And now we can state that guarantee.

Theorem (from Arora et al): The cumulative reward of the MWUA algorithm is, up to constant multiplicative factors, at least the cumulative reward of the best object minus $\log(n)$, where $n$ is the number of objects. (Exact formula at the end of the proof)

The core of the proof, which we’ll state as a lemma, uses one of the most elegant proof techniques in all of mathematics. It’s the idea of constructing a potential function, and tracking the change in that potential function over time. Such a proof usually has the mysterious script:

1. Define potential function, in our case $S_t$.
2. State what seems like trivial facts about the potential function to write $S_{t+1}$ in terms of $S_t$, and hence get general information about $S_T$ for some large $T$.
3. Theorem is proved.
4. Wait, what?

Clearly, coming up with a useful potential function is a difficult and prized skill.

In this proof our potential function is the sum of the weights of the objects in a given round, $S_t = \sum_{x \in X} w_{x, t}$. Now the lemma.

Lemma: Let $B$ be the bound on the size of the rewards, and $0 < \varepsilon < 1/2$ a learning parameter. Recall that $D_t(x)$ is the probability that MWUA draws object $x$ in round $t$. Write the expected reward for MWUA for round $t$ as the following (using only the definition of expected value):

$\displaystyle R_t = \sum_{x \in X} D_t(x) M(x, y_t)$

Then the claim of the lemma is:

$\displaystyle S_{t+1} \leq S_t e^{\varepsilon R_t / B}$

Proof. Expand $S_{t+1} = \sum_{x \in X} w_{x, t+1}$ using the definition of the MWUA update:

$\displaystyle \sum_{x \in X} w_{x, t+1} = \sum_{x \in X} w_{x, t}(1 + \varepsilon M(x, y_t) / B)$

Now distribute $w_{x, t}$ and split into two sums:

$\displaystyle \dots = \sum_{x \in X} w_{x, t} + \frac{\varepsilon}{B} \sum_{x \in X} w_{x,t} M(x, y_t)$

Using the fact that $D_t(x) = \frac{w_{x,t}}{S_t}$, we can replace $w_{x,t}$ with $D_t(x) S_t$, which allows us to get $R_t$

\displaystyle \begin{aligned} \dots &= S_t + \frac{\varepsilon S_t}{B} \sum_{x \in X} D_t(x) M(x, y_t) \\ &= S_t \left ( 1 + \frac{\varepsilon R_t}{B} \right ) \end{aligned}

And then using the fact that $(1 + x) \leq e^x$ (Taylor series), we can bound the last expression by $S_te^{\varepsilon R_t / B}$, as desired.

$\square$

Now using the lemma, we can get a hold on $S_T$ for a large $T$, namely that

$\displaystyle S_T \leq S_1 e^{\varepsilon \sum_{t=1}^T R_t / B}$

If $|X| = n$ then $S_1=n$, simplifying the above. Moreover, the sum of the weights in round $T$ is certainly greater than any single weight, so that for every fixed object $x \in X$,

$\displaystyle S_T \geq w_{x,T} \leq (1 + \varepsilon)^{\sum_t M(x, y_t) / B}$

Squeezing $S_t$ between these two inequalities and taking logarithms (to simplify the exponents) gives

$\displaystyle \left ( \sum_t M(x, y_t) / B \right ) \log(1+\varepsilon) \leq \log n + \frac{\varepsilon}{B} \sum_t R_t$

Multiply through by $B$, divide by $\varepsilon$, rearrange, and use the fact that when $0 < \varepsilon < 1/2$ we have $\log(1 + \varepsilon) \geq \varepsilon - \varepsilon^2$ (Taylor series) to get

$\displaystyle \sum_t R_t \geq \left [ \sum_t M(x, y_t) \right ] (1-\varepsilon) - \frac{B \log n}{\varepsilon}$

The bracketed term is the payoff of object $x$, and MWUA’s payoff is at least a fraction of that minus the logarithmic term. The bound applies to any object $x \in X$, and hence to the best one. This proves the theorem.

$\square$

Briefly discussing the bound itself, we see that the smaller the learning rate is, the closer you eventually get to the best object, but by contrast the more the subtracted quantity $B \log(n) / \varepsilon$ hurts you. If your target is an absolute error bound against the best performing object on average, you can do more algebra to determine how many rounds you need in terms of a fixed $\delta$. The answer is roughly: let $\varepsilon = O(\delta / B)$ and pick $T = O(B^2 \log(n) / \delta^2)$. See this survey for more.

MWUA for linear programs

Now we’ll approximately solve a linear program using MWUA. Recall that a linear program is an optimization problem whose goal is to minimize (or maximize) a linear function of many variables. The objective to minimize is usually given as a dot product $c \cdot x$, where $c$ is a fixed vector and $x = (x_1, x_2, \dots, x_n)$ is a vector of non-negative variables the algorithm gets to choose. The choices for $x$ are also constrained by a set of $m$ linear inequalities, $A_i \cdot x \geq b_i$, where $A_i$ is a fixed vector and $b_i$ is a scalar for $i = 1, \dots, m$. This is usually summarized by putting all the $A_i$ in a matrix, $b_i$ in a vector, as

$x_{\textup{OPT}} = \textup{argmin}_x \{ c \cdot x \mid Ax \geq b, x \geq 0 \}$

We can further simplify the constraints by assuming we know the optimal value $Z = c \cdot x_{\textup{OPT}}$ in advance, by doing a binary search (more on this later). So, if we ignore the hard constraint $Ax \geq b$, the “easy feasible region” of possible $x$‘s includes $\{ x \mid x \geq 0, c \cdot x = Z \}$.

In order to fit linear programming into the MWUA framework we have to define two things.

1. The objects: the set of linear inequalities $A_i \cdot x \geq b_i$.
2. The rewards: the error of a constraint for a special input vector $x_t$.

Number 2 is curious (why would we give a reward for error?) but it’s crucial and we’ll discuss it momentarily.

The special input $x_t$ depends on the weights in round $t$ (which is allowed, recall). Specifically, if the weights are $w = (w_1, \dots, w_m)$, we ask for a vector $x_t$ in our “easy feasible region” which satisfies

$\displaystyle (A^T w) \cdot x_t \geq w \cdot b$

For this post we call the implementation of procuring such a vector the “oracle,” since it can be seen as the black-box problem of, given a vector $\alpha$ and a scalar $\beta$ and a convex region $R$, finding a vector $x \in R$ satisfying $\alpha \cdot x \geq \beta$. This allows one to solve more complex optimization problems with the same technique, swapping in a new oracle as needed. Our choice of inputs, $\alpha = A^T w, \beta = w \cdot b$, are particular to the linear programming formulation.

Two remarks on this choice of inputs. First, the vector $A^T w$ is a weighted average of the constraints in $A$, and $w \cdot b$ is a weighted average of the thresholds. So this this inequality is a “weighted average” inequality (specifically, a convex combination, since the weights are nonnegative). In particular, if no such $x$ exists, then the original linear program has no solution. Indeed, given a solution $x^*$ to the original linear program, each constraint, say $A_1 x^*_1 \geq b_1$, is unaffected by left-multiplication by $w_1$.

Second, and more important to the conceptual understanding of this algorithm, the choice of rewards and the multiplicative updates ensure that easier constraints show up less prominently in the inequality by having smaller weights. That is, if we end up overly satisfying a constraint, we penalize that object for future rounds so we don’t waste our effort on it. The byproduct of MWUA—the weights—identify the hardest constraints to satisfy, and so in each round we can put a proportionate amount of effort into solving (one of) the hard constraints. This is why it makes sense to reward error; the error is a signal for where to improve, and by over-representing the hard constraints, we force MWUA’s attention on them.

At the end, our final output is an average of the $x_t$ produced in each round, i.e. $x^* = \frac{1}{T}\sum_t x_t$. This vector satisfies all the constraints to a roughly equal degree. We will skip the proof that this vector does what we want, but see these notes for a simple proof. We’ll spend the rest of this post implementing the scheme outlined above.

Implementing the oracle

Fix the convex region $R = \{ c \cdot x = Z, x \geq 0 \}$ for a known optimal value $Z$. Define $\textup{oracle}(\alpha, \beta)$ as the problem of finding an $x \in R$ such that $\alpha \cdot x \geq \beta$.

For the case of this linear region $R$, we can simply find the index $i$ which maximizes $\alpha_i Z / c_i$. If this value exceeds $\beta$, we can return the vector with that value in the $i$-th position and zeros elsewhere. Otherwise, the problem has no solution.

To prove the “no solution” part, say $n=2$ and you have $x = (x_1, x_2)$ a solution to $\alpha \cdot x \geq \beta$. Then for whichever index makes $\alpha_i Z / c_i$ bigger, say $i=1$, you can increase $\alpha \cdot x$ without changing $c \cdot x = Z$ by replacing $x_1$ with $x_1 + (c_2/c_1)x_2$ and $x_2$ with zero. I.e., we’re moving the solution $x$ along the line $c \cdot x = Z$ until it reaches a vertex of the region bounded by $c \cdot x = Z$ and $x \geq 0$. This must happen when all entries but one are zero. This is the same reason why optimal solutions of (generic) linear programs occur at vertices of their feasible regions.

The code for this becomes quite simple. Note we use the numpy library in the entire codebase to make linear algebra operations fast and simple to read.

def makeOracle(c, optimalValue):
n = len(c)

def oracle(weightedVector, weightedThreshold):
def quantity(i):
return weightedVector[i] * optimalValue / c[i] if c[i] > 0 else -1

biggest = max(range(n), key=quantity)
if quantity(biggest) < weightedThreshold:
raise InfeasibleException

return numpy.array([optimalValue / c[i] if i == biggest else 0 for i in range(n)])

return oracle


Implementing the core solver

The core solver implements the discussion from previously, given the optimal value of the linear program as input. To avoid too many single-letter variable names, we use linearObjective instead of $c$.

def solveGivenOptimalValue(A, b, linearObjective, optimalValue, learningRate=0.1):
m, n = A.shape  # m equations, n variables
oracle = makeOracle(linearObjective, optimalValue)

def reward(i, specialVector):
...

def observeOutcome(_, weights, __):
...

numRounds = 1000
weights, cumulativeReward, outcomes = MWUA(
range(m), observeOutcome, reward, learningRate, numRounds
)
averageVector = sum(outcomes) / numRounds

return averageVector


First we make the oracle, then the reward and outcome-producing functions, then we invoke the MWUA subroutine. Here are those two functions; they are closures because they need access to $A$ and $b$. Note that neither $c$ nor the optimal value show up here.

    def reward(i, specialVector):
constraint = A[i]
threshold = b[i]
return threshold - numpy.dot(constraint, specialVector)

def observeOutcome(_, weights, __):
weights = numpy.array(weights)
weightedVector = A.transpose().dot(weights)
weightedThreshold = weights.dot(b)
return oracle(weightedVector, weightedThreshold)


Implementing the binary search, and an example

Finally, the top-level routine. Note that the binary search for the optimal value is sophisticated (though it could be more sophisticated). It takes a max range for the search, and invokes the optimization subroutine, moving the upper bound down if the linear program is feasible and moving the lower bound up otherwise.

def solve(A, b, linearObjective, maxRange=1000):
optRange = [0, maxRange]

while optRange[1] - optRange[0] > 1e-8:
proposedOpt = sum(optRange) / 2
print("Attempting to solve with proposedOpt=%G" % proposedOpt)

# Because the binary search starts so high, it results in extreme
# reward values that must be tempered by a slow learning rate. Exercise
# to the reader: determine absolute bounds for the rewards, and set
# this learning rate in a more principled fashion.
learningRate = 1 / max(2 * proposedOpt * c for c in linearObjective)
learningRate = min(learningRate, 0.1)

try:
result = solveGivenOptimalValue(A, b, linearObjective, proposedOpt, learningRate)
optRange[1] = proposedOpt
except InfeasibleException:
optRange[0] = proposedOpt

return result


Finally, a simple example:

A = numpy.array([[1, 2, 3], [0, 4, 2]])
b = numpy.array([5, 6])
c = numpy.array([1, 2, 1])

x = solve(A, b, c)
print(x)
print(c.dot(x))
print(A.dot(x) - b)


The output:

Attempting to solve with proposedOpt=500
Attempting to solve with proposedOpt=250
Attempting to solve with proposedOpt=125
Attempting to solve with proposedOpt=62.5
Attempting to solve with proposedOpt=31.25
Attempting to solve with proposedOpt=15.625
Attempting to solve with proposedOpt=7.8125
Attempting to solve with proposedOpt=3.90625
Attempting to solve with proposedOpt=1.95312
Attempting to solve with proposedOpt=2.92969
Attempting to solve with proposedOpt=3.41797
Attempting to solve with proposedOpt=3.17383
Attempting to solve with proposedOpt=3.05176
Attempting to solve with proposedOpt=2.99072
Attempting to solve with proposedOpt=3.02124
Attempting to solve with proposedOpt=3.00598
Attempting to solve with proposedOpt=2.99835
Attempting to solve with proposedOpt=3.00217
Attempting to solve with proposedOpt=3.00026
Attempting to solve with proposedOpt=2.99931
Attempting to solve with proposedOpt=2.99978
Attempting to solve with proposedOpt=3.00002
Attempting to solve with proposedOpt=2.9999
Attempting to solve with proposedOpt=2.99996
Attempting to solve with proposedOpt=2.99999
Attempting to solve with proposedOpt=3.00001
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3  # note %G rounds the printed values
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
[ 0.     0.987  1.026]
3.00000000425
[  5.20000072e-02   8.49831849e-09]


So there we have it. A fiendishly clever use of multiplicative weights for solving linear programs.

Discussion

One of the nice aspects of MWUA is it’s completely transparent. If you want to know why a decision was made, you can simply look at the weights and look at the history of rewards of the objects. There’s also a clear interpretation of what is being optimized, as the potential function used in the proof is a measure of both quality and adaptability to change. The latter is why MWUA succeeds even in adversarial settings, and why it makes sense to think about MWUA in the context of evolutionary biology.

This even makes one imagine new problems that traditional algorithms cannot solve, but which MWUA handles with grace. For example, imagine trying to solve an “online” linear program in which over time a constraint can change. MWUA can adapt to maintain its approximate solution.

The linear programming technique is known in the literature as the Plotkin-Shmoys-Tardos framework for covering and packing problems. The same ideas extend to other convex optimization problems, including semidefinite programming.

If you’ve been reading this entire post screaming “This is just gradient descent!” Then you’re right and wrong. It bears a striking resemblance to gradient descent (see this document for details about how special cases of MWUA are gradient descent by another name), but the adaptivity for the rewards makes MWUA different.

Even though so many people have been advocating for MWUA over the past decade, it’s surprising that it doesn’t show up in the general math/CS discourse on the internet or even in many algorithms courses. The Arora survey I referenced is from 2005 and the linear programming technique I demoed is originally from 1991! I took algorithms classes wherever I could, starting undergraduate in 2007, and I didn’t even hear a whisper of this technique until midway through my PhD in theoretical CS (I did, however, study fictitious play in a game theory class). I don’t have an explanation for why this is the case, except maybe that it takes more than 20 years for techniques to make it to the classroom. At the very least, this is one good reason to go to graduate school. You learn the things (and where to look for the things) which haven’t made it to classrooms yet.

Until next time!

Bezier Curves and Picasso

Pablo Picasso in front of The Kitchen, photo by Herbert List.

Simplicity and the Artist

Some of my favorite of Pablo Picasso’s works are his line drawings. He did a number of them about animals: an owl, a camel, a butterfly, etc. This piece called “Dog” is on my wall:

(Jump to interactive demo where we recreate “Dog” using the math in this post)

These paintings are extremely simple but somehow strike the viewer as deeply profound. They give the impression of being quite simple to design and draw. A single stroke of the hand and a scribbled signature, but what a masterpiece! It simultaneously feels like a hasty afterthought and a carefully tuned overture to a symphony of elegance. In fact, we know that Picasso’s process was deep. For example, in 1945-1946, Picasso made a series of eleven drawings (lithographs, actually) showing the progression of his rendition of a bull. The first few are more or less lifelike, but as the series progresses we see the bull boiled down to its essence, the final painting requiring a mere ten lines. Along the way we see drawings of a bull that resemble some of Picasso’s other works (number 9 reminding me of the sculpture at Daley Center Plaza in Chicago). Read more about the series of lithographs here.

Picasso’s, “The Bull.” Photo taken by Jeremy Kun at the Art Institute of Chicago in 2013. Click to enlarge.

Now I don’t pretend to be a qualified artist (I couldn’t draw a bull to save my life), but I can recognize the mathematical aspects of his paintings, and I can write a damn fine program. There is one obvious way to consider Picasso-style line drawings as a mathematical object, and it is essentially the Bezier curve. Let’s study the theory behind Bezier curves, and then write a program to draw them. The mathematics involved requires no background knowledge beyond basic algebra with polynomials, and we’ll do our best to keep the discussion low-tech. Then we’ll explore a very simple algorithm for drawing Bezier curves, implement it in Javascript, and recreate one of Picasso’s line drawings as a sequence of Bezier curves.

The Bezier Curve and Parameterizations

When asked to conjure a “curve” most people (perhaps plagued by their elementary mathematics education) will either convulse in fear or draw part of the graph of a polynomial. While these are fine and dandy curves, they only represent a small fraction of the world of curves. We are particularly interested in curves which are not part of the graphs of any functions.

Three French curves.

For instance, a French curve is a physical template used in (manual) sketching to aid the hand in drawing smooth curves. Tracing the edges of any part of these curves will usually give you something that is not the graph of a function. It’s obvious that we need to generalize our idea of what a curve is a bit. The problem is that many fields of mathematics define a curve to mean different things.  The curves we’ll be looking at, called Bezier curves, are a special case of  single-parameter polynomial plane curves. This sounds like a mouthful, but what it means is that the entire curve can be evaluated with two polynomials: one for the $x$ values and one for the $y$ values. Both polynomials share the same variable, which we’ll call $t$, and $t$ is evaluated at real numbers.

An example should make this clear. Let’s pick two simple polynomials in $t$, say $x(t) = t^2$ and $y(t) = t^3$. If we want to find points on this curve, we can just choose values of $t$ and plug them into both equations. For instance, plugging in $t = 2$ gives the point $(4, 8)$ on our curve. Plotting all such values gives a curve that is definitely not the graph of a function:

But it’s clear that we can write any single-variable function $f(x)$ in this parametric form: just choose $x(t) = t$ and $y(t) = f(t)$. So these are really more general objects than regular old functions (although we’ll only be working with polynomials in this post).

Quickly recapping, a single-parameter polynomial plane curve is defined as a pair of polynomials $x(t), y(t)$ in the same variable $t$. Sometimes, if we want to express the whole gadget in one piece, we can take the coefficients of common powers of $t$ and write them as vectors in the $x$ and $y$ parts. Using the $x(t) = t^2, y(t) = t^3$ example above, we can rewrite it as

$\mathbf{f}(t) = (0,1) t^3 + (1,0) t^2$

Here the coefficients are points (which are the same as vectors) in the plane, and we represent the function $f$ in boldface to emphasize that the output is a point. The linear-algebraist might recognize that pairs of polynomials form a vector space, and further combine them as $(0, t^3) + (t^2, 0) = (t^2, t^3)$. But for us, thinking of points as coefficients of a single polynomial is actually better.

We will also restrict our attention to single-parameter polynomial plane curves for which the variable $t$ is allowed to range from zero to one. This might seem like an awkward restriction, but in fact every finite single-parameter polynomial plane curve can be written this way (we won’t bother too much with the details of how this is done). For the purpose of brevity, we will henceforth call a “single-parameter polynomial plane curve where $t$ ranges from zero to one” simply a “curve.”

Now there are some very nice things we can do with curves. For instance, given any two points in the plane $P = (p_1, p_2), Q = (q_1, q_2)$ we can describe the straight line between them as a curve: $\mathbf{L}(t) = (1-t)P + tQ$. Indeed, at $t=0$ the value $\mathbf{L}(t)$ is exactly $P$, at $t=1$ it’s exactly $Q$, and the equation is a linear polynomial in $t$. Moreover (without getting too much into the calculus details), the line $\mathbf{L}$ travels at “unit speed” from $P$ to $Q$. In other words, we can think of $\mathbf{L}$ as describing the motion of a particle from $P$ to $Q$ over time, and at time $1/4$ the particle is a quarter of the way there, at time $1/2$ it’s halfway, etc. (An example of a straight line which doesn’t have unit speed is, e.g. $(1-t^2) P + t^2 Q$.)

More generally, let’s add a third point $R$. We can describe a path which goes from $P$ to $R$, and is “guided” by $Q$ in the middle. This idea of a “guiding” point is a bit abstract, but computationally no more difficult. Instead of travelling from one point to another at constant speed, we want to travel from one line to another at constant speed. That is, call the two curves describing lines from $P \to Q$ and $Q \to R$ $\mathbf{L}_1, \mathbf{L_2}$, respectively. Then the curve “guided” by $Q$ can be written as a curve

$\displaystyle \mathbf{F}(t) = (1-t)\mathbf{L}_1(t) + t \mathbf{L}_2(t)$

Multiplying this all out gives the formula

$\displaystyle \mathbf{F}(t) = (1-t)^2 P + 2t(1-t)Q + t^2 R$

We can interpret this again in terms of a particle moving. At the beginning of our curve the value of $t$ is small, and so we’re sticking quite close to the line $\mathbf{L}_1$ As time goes on the point $\mathbf{F}(t)$ moves along the line between the points $\mathbf{L}_1(t)$ and $\mathbf{L}_2(t)$, which are themselves moving. This traces out a curve which looks like this

This screenshot was taken from a wonderful demo by data visualization consultant Jason Davies. It expresses the mathematical idea quite superbly, and one can drag the three points around to see how it changes the resulting curve. One should play with it for at least five minutes.

The entire idea of a Bezier curve is a generalization of this principle: given a list $P_0, \dots, P_n$ of points in the plane, we want to describe a curve which travels from the first point to the last, and is “guided” in between by the remaining points. A Bezier curve is a realization of such a curve (a single-parameter polynomial plane curve) which is the inductive continuation of what we described above: we travel at unit speed from a Bezier curve defined by the first $n-1$ points in the list to the curve defined by the last $n-1$ points. The base case is the straight-line segment (or the single point, if you wish). Formally,

Definition: Given a list of points in the plane $P_0, \dots, P_n$ we define the degree $n-1$ Bezier curve recursively as

\begin{aligned} \mathbf{B}_{P_0}(t) &= P_0 \\ \mathbf{B}_{P_0 P_1 \dots P_n}(t) &= (1-t)\mathbf{B}_{P_0 P_1 \dots P_{n-1}} + t \mathbf{B}_{P_1P_2 \dots P_n}(t) \end{aligned}

We call $P_0, \dots, P_n$ the control points of $\mathbf{B}$.

While the concept of travelling at unit speed between two lower-order Bezier curves is the real heart of the matter (and allows us true computational insight), one can multiply all of this out (using the formula for binomial coefficients) and get an explicit formula. It is:

$\displaystyle \mathbf{B}_{P_0 \dots P_n} = \sum_{k=0}^n \binom{n}{k}(1-t)^{n-k}t^k P_i$

And for example, a cubic Bezier curve with control points $P_0, P_1, P_2, P_3$ would have equation

$\displaystyle (1-t)^3 P_0 + 3(1-t)^2t P_1 + 3(1-t)t^2 P_2 + t^3 P_3$

Higher dimensional Bezier curves can be quite complicated to picture geometrically. For instance, the following is a fifth-degree Bezier curve (with six control points).

A degree five Bezier curve, credit Wikipedia.

The additional line segments drawn show the recursive nature of the curve. The simplest are the green points, which travel from control point to control point. Then the blue points travel on the line segments between green points, the pink travel along the line segments between blue, the orange between pink, and finally the red point travels along the line segment between the orange points.

Without the recursive structure of the problem (just seeing the curve) it would be a wonder how one could actually compute with these things. But as we’ll see, the algorithm for drawing a Bezier curve is very natural.

Bezier Curves as Data, and de Casteljau’s Algorithm

Let’s derive and implement the algorithm for painting a Bezier curve to a screen using only the ability to draw straight lines. For simplicity, we’ll restrict our attention to degree-three (cubic) Bezier curves. Indeed, every Bezier curve can be written as a combination of cubic curves via the recursive definition, and in practice cubic curves balance computational efficiency and expressiveness. All of the code we present in this post will be in Javascript, and is available on this blog’s Github page.

So then a cubic Bezier curve is represented in a program by a list of four points. For example,

var curve = [[1,2], [5,5], [4,0], [9,3]];

Most graphics libraries (including the HTML5 canvas standard) provide a drawing primitive that can output Bezier curves given a list of four points. But suppose we aren’t given such a function. Suppose that we only have the ability to draw straight lines. How would one go about drawing an approximation to a Bezier curve? If such an algorithm exists (it does, and we’re about to see it) then we could make the approximation so fine that it is visually indistinguishable from a true Bezier curve.

The key property of Bezier curves that allows us to come up with such an algorithm is the following:

Any cubic Bezier curve $\mathbf{B}$ can be split into two, end to end,
which together trace out the same curve as $\mathbf{B}$.

Let see exactly how this is done. Let $\mathbf{B}(t)$ be a cubic Bezier curve with control points $P_0, P_1, P_2, P_3$, and let’s say we want to split it exactly in half. We notice that the formula for the curve when we plug in $1/2$, which is

$\displaystyle \mathbf{B}(1/2) = \frac{1}{2^3}(P_0 + 3P_1 + 3P_2 + P_3)$

Moreover, our recursive definition gave us a way to evaluate the point in terms of smaller-degree curves. But when these are evaluated at 1/2 their formulae are similarly easy to write down. The picture looks like this:

The green points are the degree one curves, the pink points are the degree two curves, and the blue point is the cubic curve. We notice that, since each of the curves are evaluated at $t=1/2$, each of these points can be described as the midpoints of points we already know. So $m_0 = (P_0 + P_1) / 2, q_0 = (m_0 + m_1)/2$, etc.

In fact, the splitting of the two curves we want is precisely given by these points. That is, the “left” half of the curve is given by the curve $\mathbf{L}(t)$ with control points $P_0, m_0, q_0, \mathbf{B}(1/2)$, while the “right” half $\mathbf{R}(t)$ has control points $\mathbf{B}(1/2), q_1, m_2, P_3$.

How can we be completely sure these are the same Bezier curves? Well, they’re just polynomials. We can compare them for equality by doing a bunch of messy algebra. But note, since $\mathbf{L}(t)$ only travels halfway along $\mathbf{B}(t)$, to check they are the same is to equate $\mathbf{L}(t)$ with $\mathbf{B}(t/2)$, since as $t$ ranges from zero to one, $t/2$ ranges from zero to one half. Likewise, we can compare $\mathbf{B}((t+1)/2)$ with $\mathbf{R}(t)$.

The algebra is very messy, but doable. As a test of this blog’s newest tools, here’s a screen cast of me doing the algebra involved in proving the two curves are identical.

Now that that’s settled, we have a nice algorithm for splitting a cubic Bezier (or any Bezier) into two pieces. In Javascript,

function subdivide(curve) {
var firstMidpoints = midpoints(curve);
var secondMidpoints = midpoints(firstMidpoints);
var thirdMidpoints = midpoints(secondMidpoints);

return [[curve[0], firstMidpoints[0], secondMidpoints[0], thirdMidpoints[0]],
[thirdMidpoints[0], secondMidpoints[1], firstMidpoints[2], curve[3]]];
}

Here “curve” is a list of four points, as described at the beginning of this section, and the output is a list of two curves with the correct control points. The “midpoints” function used is quite simple, and we include it here for compelteness:

function midpoints(pointList) {
var midpoint = function(p, q) {
return [(p[0] + q[0]) / 2.0, (p[1] + q[1]) / 2.0];
};

var midpointList = new Array(pointList.length - 1);
for (var i = 0; i < midpointList.length; i++) {
midpointList[i] = midpoint(pointList[i], pointList[i+1]);
}

return midpointList;
}


It just accepts as input a list of points and computes their sequential midpoints. So a list of $n$ points is turned into a list of $n-1$ points. As we saw, we need to call this function $d-1$ times to compute the segmentation of a degree $d$ Bezier curve.

As explained earlier, we can keep subdividing our curve over and over until each of the tiny pieces are basically lines. That is, our function to draw a Bezier curve from the beginning will be as follows:

function drawCurve(curve, context) {
if (isFlat(curve)) {
drawSegments(curve, context);
} else {
var pieces = subdivide(curve);
drawCurve(pieces[0], context);
drawCurve(pieces[1], context);
}
}


In words, as long as the curve isn’t “flat,” we want to subdivide and draw each piece recursively. If it is flat, then we can simply draw the three line segments of the curve and be reasonably sure that it will be a good approximation. The context variable sitting there represents the canvas to be painted to; it must be passed through to the “drawSegments” function, which simply paints a straight line to the canvas.

Of course this raises the obvious question: how can we tell if a Bezier curve is flat? There are many ways to do so. One could compute the angles of deviation (from a straight line) at each interior control point and add them up. Or one could compute the volume of the enclosed quadrilateral. However, computing angles and volumes is usually not very nice: angles take a long time to compute and volumes have stability issues, and the algorithms which are stable are not very simple. We want a measurement which requires only basic arithmetic and perhaps a few logical conditions to check.

It turns out there is such a measurement. It’s originally attributed to Roger Willcocks, but it’s quite simple to derive by hand.

Essentially, we want to measure the “flatness” of a cubic Bezier curve by computing the distance of the actual curve at time $t$ from where the curve would be at time $t$ if the curve were a straight line.

Formally, given $\mathbf{B}(t)$ with control points $P_0, P_1, P_2, P_3$ as usual, we can define the straight-line Bezier cubic as the colossal sum

$\displaystyle \mathbf{S}(t) = (1-t)^3P_0 + 3(1-t)^2t \left ( \frac{2}{3}P_0 + \frac{1}{3}P_3 \right ) + 3(1-t)t^2 \left ( \frac{1}{3}P_0 + \frac{2}{3}P_3 \right ) + t^3 P_3$

There’s nothing magical going on here. We’re simply giving the Bezier curve with control points $P_0, \frac{2}{3}P_0 + \frac{1}{3}P_3, \frac{1}{3}P_0 + \frac{2}{3}P_3, P_3$. One should think about this as points which are a 0, 1/3, 2/3, and 1 fraction of the way from $P_0$ to $P_3$ on a straight line.

Then we define the function $d(t) = \left \| \mathbf{B}(t) - \mathbf{S}(t) \right \|$ to be the distance between the two curves at the same time $t$. The flatness value of $\mathbf{B}$ is the maximum of $d$ over all values of $t$. If this flatness value is below a certain tolerance level, then we call the curve flat.

With a bit of algebra we can simplify this expression. First, the value of $t$ for which the distance is maximized is the same as when its square is maximized, so we can omit the square root computation at the end and take that into account when choosing a flatness tolerance.

Now lets actually write out the difference as a single polynomial. First, we can cancel the 3’s in $\mathbf{S}(t)$ and write the polynomial as

$\displaystyle \mathbf{S}(t) = (1-t)^3 P_0 + (1-t)^2t (2P_0 + P_3) + (1-t)t^2 (P_0 + 2P_3) + t^3 P_3$

and so $\mathbf{B}(t) - \mathbf{S}(t)$ is (by collecting coefficients of the like terms $(1-t)^it^j$)

$\displaystyle (1-t)^2t (3 P_1 - 2P_0 - P_3) + (1-t)t^2 (3P_2 - P_0 - 2P_3)$

Factoring out the $(1-t)t$ from both terms and setting $a = 3P_1 - 2P_0 - P_3$, $b = 3P_2 - P_0 - 2P_3$, we get

$\displaystyle d^2(t) = \left \| (1-t)t ((1-t)a + tb) \right \|^2 = (1-t)^2t^2 \left \| (1-t)a + tb \right \|^2$

Since the maximum of a product is at most the product of the maxima, we can bound the above quantity by the product of the two maxes. The reason we want to do this is because we can easily compute the two maxes separately. It wouldn’t be hard to compute the maximum without splitting things up, but this way ends up with fewer computational steps for our final algorithm, and the visual result is equally good.

Using some elementary single-variable calculus, the maximum value of $(1-t)^2t^2$ for $0 \leq t \leq 1$ turns out to be $1/16$. And the norm of a vector is just the sum of squares of its components. If $a = (a_x, a_y)$ and $b = (b_x, b_y)$, then the norm above is exactly

$\displaystyle ((1-t)a_x + tb_x)^2 + ((1-t)a_y + tb_y)^2$

And notice: for any real numbers $z, w$ the quantity $(1-t)z + tw$ is exactly the straight line from $z$ to $w$ we know so well. The maximum over all $t$ between zero and one is obviously the maximum of the endpoints $z, w$. So the max of our distance function $d^2(t)$ is bounded by

$\displaystyle \frac{1}{16} (\textup{max}(a_x^2, b_x^2) + \textup{max}(a_y^2, b_y^2))$

And so our condition for being flat is that this bound is smaller than some allowable tolerance. We may safely factor the 1/16 into this tolerance bound, and so this is enough to write a function.

function isFlat(curve) {
var tol = 10; // anything below 50 is roughly good-looking

var ax = 3.0*curve[1][0] - 2.0*curve[0][0] - curve[3][0]; ax *= ax;
var ay = 3.0*curve[1][1] - 2.0*curve[0][1] - curve[3][1]; ay *= ay;
var bx = 3.0*curve[2][0] - curve[0][0] - 2.0*curve[3][0]; bx *= bx;
var by = 3.0*curve[2][1] - curve[0][1] - 2.0*curve[3][1]; by *= by;

return (Math.max(ax, bx) + Math.max(ay, by) <= tol);
}


And there we have it. We write a simple HTML page to access a canvas element and a few extra helper functions to draw the line segments when the curve is flat enough, and present the final result in this interactive demonstration (you can perturb the control points).

The picture you see on that page (given below) is my rendition of Picasso’s “Dog” drawing as a sequence of nine Bezier curves. I think the resemblance is uncanny 🙂

Picasso’s “Dog,” redesigned as a sequence of nine bezier curves.

While we didn’t invent the drawing itself (and hence shouldn’t attach our signature to it), we did come up with the representation as a sequence of Bezier curves. It only seems fitting to present that as the work of art. Here we’ve distilled the representation down to a single file: the first line is the dimension of the canvas, and each subsequent line represents a cubic Bezier curve. Comments are included for readability.

“Dog” Jeremy Kun, 2013. Click to enlarge.

Because standardizing things seems important, we define a new filetype “.bezier”, which has the format given above:

int int
(int) curve
(int) curve
...

Where the first two ints specify the size of the canvas, the first (optional) int on each line specifies the width of the stroke, and a “curve” has the form

[int,int] [int,int] ... [int,int]

If an int is omitted at the beginning of a line, this specifies a width of three pixels.

In a general .bezier file we allow a curve to have arbitrarily many control points, though the code we gave above does not draw them that generally. As an exercise, write a program which accepts as input a .bezier file and produces as output an image of the drawing. This will require an extension of the algorithm above for drawing arbitrary Bezier curves, which loops its computation of the midpoints and keeps track of which end up in the resulting subdivision. Alternatively, one could write a program which accepts as input a .bezier file with only cubic Bezier curves, and produces as output an SVG file of the drawing (SVG only supports cubic Bezier curves). So a .bezier file is a simplification (fewer features) and an extension (Bezier curves of arbitrary degree) of an SVG file.

We didn’t go as deep into the theory of Bezier curves as we could have. If the reader is itching for more (and a more calculus-based approach), see this lengthy primer. It contains practically everything one could want to know about Bezier curves, with nice interactive demos written in Processing.

Low-Complexity Art

There are some philosophical implications of what we’ve done today with Picasso’s “Dog.” Previously on this blog we’ve investigated the idea of low-complexity art, and it’s quite relevant here. The thesis is that “beautiful” art has a small description length, and more formally the “complexity” of some object (represented by text) is the length of the shortest program that outputs that object given no inputs. More on that in our primer on Kolmogorov complexity. The fact that we can describe Picasso’s line drawings with a small number of Bezier curves (and a relatively short program to output the bezier curves) is supposed to be a deep statement about the beauty of the art itself. Obviously this is very subjective, but not without its proponents.

There has been a bit of recent interest in computers generating art. For instance, this recent programming competition (in Dutch) gave the task of generating art similar to the work of Piet Mondrian. The idea is that the more elegant the algorithm, the higher it would be scored. The winner used MD5 hashes to generate Mondrian pieces, and there were many many other impressive examples (the link above has a gallery of submissions).

In our earlier post on low-complexity art, we explored the possibility of representing all images within a coordinate system involving circles with shaded interiors. But it’s obvious that such a coordinate system wouldn’t be able to represent “Dog” with very low complexity. It seems that Bezier curves are a much more natural system of coordinates. Some of the advantages include that length of lines and slight perturbations don’t affect the resulting complexity. A cubic Bezier curve can be described by any set of four points, and more “intricate” (higher complexity) descriptions of curves require a larger number of points. Bezier curves can be scaled up arbitrarily, and this doesn’t significantly change the complexity of the curve (although scaling many orders of magnitude will introduce a logarithmic factor complexity increase, this is quite small). Curves with larger stroke are slightly more complex than those with smaller stroke, and representing many small sharp bends require more curves than long, smooth arcs.

On the downside, it’s not so easy to represent a circle as a Bezier curve. In fact, it is impossible to do so exactly. Despite the simplicity of this object (it’s even defined as a single polynomial, albeit in two variables), the best one can do is approximate it. The same goes for ellipses. There are actually ways to overcome this (the concept of rational Bezier curves which are quotients of polynomials), but they add to the inherent complexity of the drawing algorithm and the approximations using regular Bezier curves are good enough.

And so we define the complexity of a drawing to be the number of bits in its .bezier file representation. Comments are ignored in this calculation.

The real prize, and what we’ll explore next time, is to find a way to generate art automatically. That is to do one of two things:

1. Given some sort of “seed,” write a program that produces a pseudo-random line drawing.
2. Given an image, produce a .bezier image which accurately depicts the image as a line drawing.

We will attempt to explore these possibilities in the follow-up to this post. Depending on how things go, this may involve some local search algorithms, genetic algorithms, or other methods.

Until then!

Addendum: want to buy a framed print of the source code for “Dog”? Head over to our page on Society6.

Seam Carving for Content-Aware Image Scaling

The Problem with Cropping

Every programmer or graphic designer with some web development experience can attest to the fact that finding good images that have an exactly specified size is a pain. Since the dimensions of the sought picture are usually inflexible, an uncomfortable compromise can come in the form of cropping a large image down to size or scaling the image to have appropriate dimensions.

Both of these solutions are undesirable. In the example below, the caterpillar looks distorted in the scaled versions (top right and bottom left), and in the cropped version (bottom right) it’s more difficult to tell that the caterpillar is on a leaf; we have lost the surrounding context.

In this post we’ll look at a nice heuristic method for rescaling images called seam-carving, which pays attention to the contents of the image as it resacles. In particular, it only removes or adds pixels to the image that the viewer is least-likely to notice. In all but the most extreme cases it will avoid the ugly artifacts introduced by cropping and scaling, and with a bit of additional scaffolding it becomes a very useful addition to a graphic designer’s repertoire. At first we will focus on scaling an image down, and then we will see that the same technique can be used to enlarge an image.

Before we begin, we should motivate the reader with some examples of its use.

It’s clear that the caterpillar is far less distorted in all versions, and even in the harshly rescaled version, parts of the green background are preserved. Although the leaf is warped a little, it is still present, and it’s not obvious that the image was manipulated.

Now that the reader’s appetite has been whet, let’s jump into the mathematics of it. This method was pioneered by Avidan and Shamir, and the impatient reader can jump straight to their paper (which contains many more examples). In this post we hope to fill in the background and show a working implementation.

Images as Functions

One common way to view an image is as an approximation to a function of two real variables. Suppose we have an $n \times m$-pixel image ($n$ rows and $m$ columns of pixels). For simplicity (during the next few paragraphs), we will also assume that the pixel values of an image are grayscale intensity values between 0 and 255. Then we can imagine the pixel values as known integer values of a function $f: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}$. That is, if we take two integers $0 \leq x < n$ and $0 \leq y < m$ then we know the value $f(x,y)$; it’s just the intensity value at the corresponding pixel. For values outside these ranges, we can impose arbitrary values for $I$ (we don’t care what’s happening outside the image).

Moreover, it makes sense to assume that $f$ is a well-behaved function in between the pixels (i.e. it is differentiable). And so we can make reasonable guessed as to the true derivative of $f$ by looking at the differences between adjacent pixels. There are many ways to get a good approximation of the derivative of an image function, but we should pause a moment to realize why this is important to nail down for the purpose of resizing images.

A good rule of thumb with images is that regions of an image which are most important to the viewer are those which contain drastic changes in intensity or color. For instance, consider this portrait of Albert Einstein.

Which parts of this image first catch the eye? The unkempt hair, the wrinkled eyes, the bushy mustache? Certainly not the misty background, or the subtle shadows on his chin.

Indeed, one could even claim that an image having a large derivative at a certain pixel corresponds to high information content there (of course this is not true of all images, but perhaps it’s reasonable to claim this for photographs). And if we want to scale an image down in size, we are interested in eliminating those regions which have the smallest information content. Of course we cannot avoid losing some information: the image after resizing is smaller than the original, and a reasonable algorithm should not add any new information. But we can minimize the damage by intelligently picking which parts to remove; our naive assumption is that a small derivative at a pixel implies a small amount of information.

Of course we can’t just remove “regions” of an image to change its proportions. We have to remove the same number of pixels in each row or column to reduce the corresponding dimension (width or height, resp.). Before we get to that, though, let’s write a program to compute the gradient. For this program and the rest of the post we will use the Processing programming language, and our demonstrations will use the Javascript cross-compiler processing.js. The nice thing about Processing is that if you know Java then you know processing. All the basic language features are the same, and it’s just got an extra few native types and libraries to make graphics rendering and image displaying easier. As usual, all of the code used in this blog post is available on this blog’s Github page.

Let’s compute the gradient of this picture, and call the picture $I$:

A very nice picture whose gradient we can compute. It was taken by the artist Ria Czichotzki.

Since this is a color image, we will call it a function $I: \mathbb{R}^2 \to \mathbb{R}^3$, in the sense that the input is a plane coordinate $(x,y)$, and the output $I(x,y) = (r,g,b)$ is a triple of color intensity values. We will approximate the image’s partial derivative $\left \langle \partial I / \partial x, \partial I / \partial y \right \rangle$ at $(x,y)$ by inspecting values of $I$ in a neighborhood of the point:

$I(x-1,y), I(x+1, y), I(x,y-1), I(x,y+1)$.

For each pixel we call the value $|I(x+1,y) - I(x-1,y)| / 2$ the partial derivative in the $x$ direction, and $|I(x,y+1) - I(x,y-1)| / 2$ the partial in the $y$ direction. Note that the values $I(x,y)$ are vectors, so the norm signs here are really computing the distance between the two values of $I$.

There are two ways to see why this makes sense as an approximation. The first is analytic: by definition, the partial derivative $\partial I / \partial x$ is a limit:

$\displaystyle \lim_{h \to 0} \frac{|I(x+h,y) - I(x,y)|}{h}$

It turns out that this limit is equivalent to

$\displaystyle \lim_{h \to 0} \frac{|I(x+h,y) - I(x-h,y)|}{2h}$

And the closer $h$ gets to zero the better the approximation of the limit is. Since the closest we can make $h$ is $h=1$ (we don’t know any other values of $I$ with nonzero $h$), we plug in the corresponding values for neighboring pixels. The partial $\partial I / \partial y$ is similar.

The second way to view it is geometric.

The slope of the blue secant line is not a bad approximation to the derivative at x, provided the resolution is fine enough.

The salient fact here is that a nicely-behaved curve at $x$ will have a derivative close to the secant line between the points $(x-1, f(x-1))$ and $(x+1, f(x+1))$. Indeed, this idea inspires the original definition of the derivative. The slope of the secant line is just $(f(x+1) - f(x-1)) / 2$. As we saw in our post on numerical integration, we can do much better than a linear guess (specifically, we can use do any order of polynomial interpolation we wish), but for the purposes of displaying the concept of seam-carving, a linear guess will suffice.

And so with this intuitive understanding of how to approximate the gradient, the algorithm to actually do it is a straightforward loop. Here we compute the horizontal gradient (that is, the derivative $\partial I / \partial x$).

PImage horizontalGradient(PImage img) {
color left, right;
int center;
PImage newImage = createImage(img.width, img.height, RGB);

for (int x = 0; x < img.width; x++) {
for (int y = 0; y < img.height; y++) {
center = x + y*img.width;

left = x == 0 ? img.pixels[center] : img.pixels[(x-1) + y*img.width];
right = x == img.width-1 ? img.pixels[center] : img.pixels[(x+1) + y*img.width];

newImage.pixels[center] = color(colorDistance(left, right));
}
}

return newImage;
}


The details are a bit nit-picky, but the idea is simple. If we’re inspecting a non-edge pixel, then we can use the formula directly and compute the values of the neighboring left and right pixels. Otherwise, the “left” pixel or the “right” pixel will be outside the bounds of the image, and so we replace it with the pixel we’re inspecting. Mathematically, we’d be computing the difference $|I(x, y) - I(x+1, y)|$ and $|I(x-1,y) - I(x, y)|$. Additionally, since we’ll later only be interested in the relative sizes of the gradient, we can ignore the factor of 1/2 in the formula we derived.

The parts of this code that are specific to Processing also deserve some attention. Specifically, we use the built-in types PImage and color, for representing images and colors, respectively. The “createImage” function creates an empty image of the specified size. And peculiarly, the pixels of a PImage are stored as a one-dimensional array. So as we’re iterating through the rows and columns, we must compute the correct location of the sought pixel in the pixel array (this is why we have a variable called “center”). Finally, as in Java, the ternary if notation is used to keep the syntax short, and those two lines simply check for the boundary conditions we stated above.

The last unexplained bit of the above code is the “colorDistance” function. As our image function $I(x,y)$ has triples of numbers as values, we need to compute the distance between two values via the standard distance formula. We have encapsulated this in a separate function. Note that because (in this section of the blog) we are displaying the results in an image, we have to convert to an integer at the end.

int colorDistance(color c1, color c2) {
float r = red(c1) - red(c2);
float g = green(c1) - green(c2);
float b = blue(c1) - blue(c2);
return (int)sqrt(r*r + g*g + b*b);
}


Let’s see this in action on the picture we introduced earlier.

The reader who is interested in comparing the two more closely may visit this interactive page. Note that we only compute the horizontal gradient, so certain locations in the image have a large derivative but are still dark in this image. For instance, the top of the door in the background and the wooden bars supporting the bottom of the chair are dark despite the vertical color variations.

The vertical gradient computation is entirely analogous, and is left as an exercise to the reader.

Since we want to inspect both vertical and horizontal gradients, we will call the total gradient matrix $G$ the matrix whose entries $g_{i,j}$ are the sums of the magnitudes of the horizontal and vertical gradients at $i,j$:

$\displaystyle g_{i,j} = \left | \frac{\partial I}{\partial x} (i,j) \right | + \left | \frac{\partial I}{\partial y} (i,j) \right |$

The function $e(x,y) = g_{x,y}$ is often called an energy function for $I$. We will mention now that there are other energy functions one can consider, and use this energy function for the remainder of this post.

Seams, and Dynamic Programming

Back to the problem of resizing, we want a way to remove only those regions of an image that have low total gradient across all of the pixels in the region removed. But of course when resizing an image we must maintain the rectangular shape, and so we have to add or remove the same number of pixels in each column or row.

For the purpose of scaling an image down in width (and the other cases are similar), we have a few options. We could find the pixel in each row with minimal total gradient and remove it. More conservatively, we could remove those columns with minimal gradient (as a sum of the total gradient of each pixel in the column). More brashly, we could just remove pixels of lowest gradient willy-nilly from the image, and slide the rows left.

If none of these ideas sound like they would work, it’s because they don’t. We encourage the unpersuaded reader to try out each possibility on a variety of images to see just how poorly they perform. But of these options, removing an entire column happens to distort the image less than the others. Indeed, the idea of a “seam” in an image is just a slight generalization of a column. Intuitively, a seam $s_i$ is a trail of pixels traversing the image from the bottom to the top, and at each step the pixel trail can veer to the right or left by at most one pixel.

Definition: Let $I$ be an $n \times m$ image with nonnegative integer coordinates indexed from zero. A vertical seam in $I$ is a list of coordinates $s_i = (x_i, y_i)$ with the following properties:

• $y_0 = 0$ is at the bottom of the image.
• $y_{n-1} = n-1$ is at the top of the image.
• $y_i$ is strictly increasing.
• $|x_i - x_{i+1}| \leq 1$ for all $0 \leq i < n-1$.

These conditions simply formalize what we mean by a seam. The first and second impose that the seam traverses from top to bottom. The third requires the seam to always “go up,” so that there is only one pixel in each row. The last requires the seam to be “connected” in the sense that it doesn’t veer too far at any given step.

Here are some examples of some vertical seams. One can easily define horizontal seams by swapping the placement of $x, y$ in the above list of conditions.

So the goal is now to remove the seams of lowest total gradient. Here the total gradient of a seam is just the sum of the energy values of the pixels in the seam.

Unfortunately there are many more seams to choose from than columns (or even individual pixels). It might seem difficult at first to find the seam with the minimal total gradient. Luckily, if we’re only interested in minima, we can use dynamic programming to compute the minimal seam ending at any given pixel in linear time.

We point the reader unfamiliar with dynamic programming to our Python primer on this topic. In this case, the sub-problem we’re working with is the minimal total gradient value of all seams from the bottom of the image to a fixed pixel. Let’s call this value $v(a,b)$. If we know $v(a,b)$ for all pixels below, say, row $i$, then we can compute the $v(i+1,b)$ for the entire row $i+1$ by taking pixel $(i+1,j)$, and adding its gradient value to the minimum of the values of possible predecessors in a seam, $v(i,j-1), v(i,j), v(i,j+1)$ (respecting the appropriate boundary conditions).

Once we’ve computed $v(a,b)$ for the entire matrix, we can look at the minimal value at the top of the image $\min_j v(n,j)$, and work backwards down the image to compute which seam gave us this minimum.

Let’s make this concrete and compute the function $v$ as a two-dimensional array called “seamFitness.”

void computeVerticalSeams() {
seamFitness = new float[img.width][img.height];
for (int i = 0; i < img.width; i++) {
}

for (int y = 1; y < img.height; y++) {
for (int x = 0; x < img.width; x++) {

if (x == 0) {
seamFitness[x][y] += min(seamFitness[x][y-1], seamFitness[x+1][y-1]);
} else if (x == img.width-1) {
seamFitness[x][y] += min(seamFitness[x][y-1], seamFitness[x-1][y-1]);
} else {
seamFitness[x][y] += min(seamFitness[x-1][y-1], seamFitness[x][y-1], seamFitness[x+1][y-1]);
}
}
}
}


We have two global variables at work here (global is bad, I know, but it’s Processing; it’s made for prototyping). The seamFitness array, and the gradientMagnitude array. We assume at the start of this function that the gradientMagnitude array is filled with sensible values.

Here we first initialize the zero’th row of the seamFitness array to have the same values as the gradient of the image. This is simply because a seam of length 1 has only one gradient value. Note here the coordinates are a bit backwards: the first coordinate represents the choice of a column, and the second represents the choice of a row. We can think of the coordinate axes of our image function having the origin in the bottom-left, the same as we might do mathematically.

Then we iterate over the rows in the matrix, and in each column we compute the fitness based on the fitness of the previous row. That’s it 🙂

To actually remove a seam, we need to create a new image of the right size, and shift the pixels to the right (or left) of the image into place. The details are technically important, but tedious to describe fully. So we leave the inspection of the code as an exercise to the reader. We provide the Processing code on this blog’s Github page, and show an example of its use below. Note each the image resizes every time the user clicks within the image.

Photograph by Raphael Goetter.

It’s interesting (and indeed the goal) to see how at first nothing is warped, and then the lines on the walls curve around the woman’s foot, and then finally the woman’s body is distorted before she gets smushed into a tiny box by the oppressive mouse.

As a quick side note, we attempted to provide an interactive version of this Processing program online in the same way we did for the gradient computation example. Processing is quite nice in that any Processing program (which doesn’t use any fancy Java libraries) can be cross-compiled to Javascript via the processing.js library. This is what we did for the gradient example. But in doing so for the (admittedly inefficient and memory-leaky) seam-carving program, it appeared to run an order of magnitude slower in the browser than locally. This was this author’s first time using Processing, so the reason for the drastic jump in runtime is unclear. If any readers are familiar with processing.js, a clarification would be very welcome in the comments.

Inserting Seams, Removing Objects, and Videos

In addition to removing seams to scale an image down, one can just as easily insert seams to make an image larger. To insert a seam, just double each pixel in the seam and push the rest of the pixels on the row to the right. The process is not hard, but it requires avoiding one pitfall: if we just add a single seam at a time, then the seam with minimum total energy will never change! So we’ll just add the same seam over and over again. Instead, if we want to add $k$ seams, one should compute the minimum $k$ seams and insert them all. If the desired resize is too large, then the programmer should pick an appropriate batch size and add seams in batches.

Another nice technique that comes from the seam-carving algorithm is to intelligently protect or destroy specific regions in the image. To do this requires a minor modification of the gradient computation, but the rest of the algorithm is identical. To protect a region, provide some way of user input specifying which pixels in the image are important, and give those pixels an artificially large gradient value (e.g., the maximum value of an integer). If the down-scaling is not too extreme, the seam computations will be guaranteed not to use any of those pixels, and inserted seams will never repeat those pixels. To remove a region, we just give the desired pixels an arbitrarily low gradient value. Then these pixels will be guaranteed to occur in the minimal seams, and will be removed from the picture.

The technique of seam-carving is a very nice tool, and as we just saw it can be extended to a variety of other techniques. In fact, seam-carving and its applications to object removal and image resizing are implemented in all of the recent versions of Photoshop. The techniques are used to adapt applications to environments with limited screen space, such as a mobile phone or tablet. Seam carving can even be adapted for use in videos. This involves an extension of the dynamic program to work across multiple frames, formally finding a minimal graph cut between two frames so that each piece of the cut is a seam in the corresponding frame. Of course there is a lot more detail to it (and the paper linked above uses this detail to improve the basic image-resizing algorithm), but that’s the rough idea.

We’ve done precious little on this blog with images, but we’d like to get more into graphics programming. There’s a wealth of linear algebra, computational geometry, and artificial intelligence hiding behind most of the computer games we like to play, and it would be fun to dive deeper into these topics. Of course, with every new post this author suggests ten new directions for this blog to go. It’s a curse and a blessing.

Until next time!

Complete Sequences and Magic Tricks

Numberphile posted a video today describing a neat trick based on complete sequences:

The mathematics here is pretty simple, but I noticed at the end of the video that Dr. Grime was constructing the cards by hand, when really this is a job for a computer program. I thought it would be a nice warmup exercise (and a treat to all of the Numberphile viewers) to write a program to construct the cards for any complete sequence.

For the sake of bringing some definitions into it, let’s review the idea of the card trick.

Definition: A sequence of integers $a_n$ is complete if every natural number $k$ can be represented as a sum of numbers in $a_n$.

The examples used in the video are the binary numbers and the fibonacci numbers. The latter is a famous theorem Zeckendorf’s Theorem. Other famous complete sequences include the prime numbers (if you add 1) and the lazy caterer’s sequence.

Now the question is, given a complete sequence, how do we generate the cards from the video? One might recall our post on dynamic programming, in which we wrote a program to compute the optimal coin change for a given amount of money. Indeed, this method works, but we have some added structure in this problem: we know that any number can only be used once. In that case, it is easy to see that there is no need for dynamic programming. Instead, we just use a greedy algorithm.

That is, we can determine which card contains a number $k$ (which number in the sequence occurs in the representation of $k$) as follows. Start with the largest card smaller than $k$, call it $n$, and we know that $n$ shows up in the representation of $k$. To find the next number in the representation, simply repeat the process with the difference $k-n$. The next number to appear in the representation of $k$ is hence the largest card less than $k-n$. We can repeat this process until we run out of cards or the difference becomes zero.

One interesting fact about this algorithm is that it will always produce the smallest representation. That is, in performing the card trick, one is guaranteed the least amount of work in remembering which cards contained the hidden number, and the least amount of work in adding those numbers up.

To implement this algorithm in code, we used Javascript. We chose Javascript so that the reader can run the program and view its source. But we extract the most important bit of the code here:

// compute each card
var j;
for (i = 1; i <= usersMax; i++) {
var remainder = i;
for (j = numbers.length-1; j >= 0; j--) {
if (numbers[j] <= remainder) {
cards[j].push(i);
remainder -= numbers[j];
}
}
}

In words, the “numbers” array contains the complete sequence in question, the cards array is an array of arrays, where each element in the array represents one card. We then loop over all numbers (the $i$ variable), and for each number we check the sequence in decreasing order to look for membership of $i$ on a card.

For the sake of brevity, we omit all of the code to deal with input and output, which comprises all of the remaining code in this program.

So to use the program, for example with prime numbers as the complete sequence, one would type “1,2,3,5,7,13,17” into the first text box, and whatever the maximum allowable guess is in the second box, and click the “Generate cards” button.

Enjoy!