# Optimism in the Face of Uncertainty: the UCB1 Algorithm

The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

## Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of $K$ actions labeled by the integers $\left \{ 1, 2, \dots, K \right \}$. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action $i$ be drawn from a fixed distribution $D_i$ (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm $A$ and a set of actions $\left \{1, 2, \dots, K \right \}$, the cumulative regret of $A$ in rounds $1, \dots, T$ is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of $A$ for the first $T$ rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.

## Notation

Let’s go ahead and draw up some notation. As before the actions are labeled by integers $\left \{ 1, \dots, K \right \}$. The reward of action $i$ is a $[0,1]$-valued random variable $X_i$ distributed according to an unknown distribution and possessing an unknown expected value $\mu_i$. The game progresses in rounds $t = 1, 2, \dots$ so that in each round we have different random variables $X_{i,t}$ for the reward of a single action $i$ in round $t$. The $X_{i,t}$ are independent as both $t$ and $i$ vary, although when $i$ varies the distribution changes.

So if we were to play action 2 over and over for $T$ rounds, then the total payoff would be the random variable $G_2(T) = \sum_{t=1}^T X_{2,t}$. But by independence across rounds and the linearity of expectation, the expected payoff is just $\mu_2 T$. So we can describe the best action as the action with the highest expected payoff. Define

$\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i$

We call the action which achieves the maximum $i^*$.

A policy is a randomized algorithm $A$ which picks an action in each round based on the history of chosen actions and observed rewards so far. Define $I_t$ to be the action played by $A$ in round $t$ and $P_i(n)$ to be the number of times we’ve played action $i$ in rounds $1 \leq t \leq n$. These are both random variables. Then the cumulative payoff for the algorithm $A$ over the first $T$ rounds, denoted $G_A(T)$, is just

$\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}$

and its expected value is simply

$\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T))$.

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy $A$ after the first $T$ steps, denoted $R_A(T)$ can be written as

$\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)$

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

$\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T))$.

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than $\Omega(\sqrt{KT})$. It is also known that there are algorithms that guarantee no worse than $O(\sqrt{KT})$ expected regret. The algorithm we’ll see in the next section, however, only guarantees $O(\sqrt{KT \log T})$. We present it on this blog because of its simplicity and ubiquity in the field.

## The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action $\mu_i$ is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose $Y_1, \dots, Y_n$ are independent random variables whose values lie in $[0,1]$ and whose expected values are $\mu_i$. Call $Y = \frac{1}{n}\sum_{i}Y_i$ and $\mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i$. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of $Y$ deviates from its mean. Specifically,

$\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}$

For us, the $Y_i$ will be the payoff variables for a single action $j$ in the rounds for which we choose action $j$. Then the variable $Y$ is just the empirical average payoff for action $j$ over all the times we’ve tried it. Moreover, $a$ is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for $a$ to find an upper bound big enough to be confident that we’re within $a$ of the true mean.

Indeed, if we call $n_j$ the number of times we played action $j$ thus far, then $n = n_j$ in the equation above, and using $a = a(j,T) = \sqrt{2 \log(T) / n_j}$ we get that $\textup{P}(Y > \mu + a) \leq T^{-4}$, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action $j$, its upper bound $a$ grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound $a(j, T)$ is getting bigger for actions we haven’t played; any round in which we play an action $j$, it must be that $a(j, T+1) = a(j,T)$, although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

UCB1:
Play each of the $K$ actions once, giving initial values for empirical mean payoffs $\overline{x}_i$ of each action $i$.
For each round $t = K, K+1, \dots$:

Let $n_j$ represent the number of times action $j$ was played so far.
Play the action $j$ maximizing $\overline{x}_j + \sqrt{2 \log t / n_j}$.
Observe the reward $X_{j,t}$ and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means $x_j$ change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with $K$ actions, each of whose reward distribution $X_{i,t}$ has values in [0,1]. Then its expected cumulative regret after $T$ rounds is at most $O(\sqrt{KT \log T})$.

Actually, we’ll prove a more specific theorem. Let $\Delta_i$ be the difference $\mu^* - \mu_i$, where $\mu^*$ is the expected payoff of the best action, and let $\Delta$ be the minimal nonzero $\Delta_i$. That is, $\Delta_i$ represents how suboptimal an action is and $\Delta$ is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret $\mathbb{E}(R_{\textup{UCB1}}(T))$ is at most

$\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )$

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if $\Delta_i$ is small we will require more tries to know that action $i$ is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the $1 + \pi^2 / 3$ part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the $O(\sqrt{KT \log T})$ bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function $R(\Delta_1, \dots, \Delta_K)$ (excluding of course $\Delta^*$), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

$\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}$

and it’s zero if and only if for each $i$, $\Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T})$. However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the $\Delta_i = O(\sqrt{\log T})$ (which are all the same) gives a total bound of $O(K \sqrt{\log T})$. If we look at the only possible endpoint (the $\Delta_i = 1$), then we get a local maximum of $O(K \sqrt{\log T})$. But this isn’t the $O(\sqrt{KT \log T})$ we promised, what gives? Well, this upper bound grows arbitrarily large as the $\Delta_i$ go to zero. But at the same time, if all the $\Delta_i$ are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the $\Delta_i = \Delta$ are the same, then another trivial regret bound is $\Delta T$ (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at $\Delta = O(\sqrt{(K \log T) / T})$. Hence the regret bound at the switch is $\Delta T = O(\sqrt{KT \log T})$, as desired.

## Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on $P_i(T)$, the expected number of times UCB chooses an action up to round $T$. Using the $\Delta$ notation, the regret is then just $\sum_i \Delta_i \mathbb{E}(P_i(T))$, and bounding the $P_i$‘s will bound the regret.

Recall the notation for our upper bound $a(j, T) = \sqrt{2 \log T / P_j(T)}$ and let’s loosen it a bit to $a(y, T) = \sqrt{2 \log T / y}$ so that we’re allowed to “pretend” a action has been played $y$ times. Recall further that the random variable $I_t$ has as its value the index of the machine chosen. We denote by $\chi(E)$ the indicator random variable for the event $E$. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., $\overline{x}^*$ is the empirical mean of the optimal action).

Indeed for any action $i$, the only way we know how to write down $P_i(T)$ is as

$\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)$

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action $i$. Now we’re just going to pull some number $m-1$ of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than $m$ times overall, this requires an inequality.

$P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)$

These indicator functions should be read as sentences: we’re just saying that we’re picking action $i$ in round $t$ and we’ve already played $i$ at least $m$ times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action $i$ in round $t$ means that the upper bound for action $i$ exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and $i$ might coincide with the best action, but that’s fine). In notation this event is

$\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)$

Denote the upper bound $\overline{x}_i + a(i,t)$ for action $i$ in round $t$ by $U_i(t)$. Since this event must occur every time we pick action $i$ (though not necessarily vice versa), we have

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)$

We’ll do this process again but with a slightly more complicated event. If the upper bound of action $i$ exceeds that of the optimal machine, it is also the case that the maximum upper bound for action $i$ we’ve seen after the first $m$ trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round $t$ we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine $i$ (except that it’s more than $m$). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by $\overline{x}_{i,s}$ the random variable for the empirical mean after playing action $i$ a total of $s$ times, and $\overline{x}^*_s$ the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )$

Indeed, at each $t$ for which the max is greater than the min, there will be at least one pair $s,s'$ for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs $s, s'$ for which it happens. That is, we can expand the event above into the double sum which is at least as large:

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )$

We can make one other odd inequality by increasing the sum to go from $t=1$ to $\infty$. This will become clear later, but it means we can replace $t-1$ with $t$ and thus have

$\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )$

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that $\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)$. Then what can we say? Well, consider the following three events:

(1) $\displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)$
(2) $\displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)$
(3) $\displaystyle \mu^* < \mu_i + 2a(s, t)$

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability $t^{-4}$. Likewise, (2) is the event that the empirical mean payoff of action $i$ is larger than the upper confidence bound, which also occurs with probability $t^{-4}$. We will see momentarily that (3) is impossible for a well-chosen $m$ (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

$\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}$

and

$\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}$

But putting these two inequalities together gives us precisely that (3) is true:

$\mu^* < \mu_i + 2a(s,t)$

This proves the claim.

By the union bound, the probability that at least one of these events happens is $2t^{-4}$ plus whatever the probability of (3) being true is. But as we said, we’ll pick $m$ to make (3) always false. Indeed $m$ depends on which action $i$ is being played, and if $s \geq m > 8 \log T / \Delta_i^2$ then $2a(s,t) \leq \Delta_i$, and by the definition of $\Delta_i$ we have

$\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0$.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

$\square$

## Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
return math.sqrt(2 * math.log(step + 1) / numPlays)


And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
payoffSums = [0] * numActions
numPlays = [1] * numActions
ucbs = [0] * numActions

# initialize empirical sums
for t in range(numActions):
payoffSums[t] = reward(t,t)
yield t, payoffSums[t], ucbs

t = numActions

while True:
ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
action = max(range(numActions), key=lambda i: ucbs[i])
theReward = reward(action, t)
numPlays[action] += 1
payoffSums[action] += theReward

yield action, theReward, ucbs
t = t + 1


The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from $[0,1]$, biased by $1/k$ for some $5 \leq k \leq 15$. The regret and theoretical regret bound are given in the graph below.

The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

## Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!

# Seam Carving for Content-Aware Image Scaling

## The Problem with Cropping

Every programmer or graphic designer with some web development experience can attest to the fact that finding good images that have an exactly specified size is a pain. Since the dimensions of the sought picture are usually inflexible, an uncomfortable compromise can come in the form of cropping a large image down to size or scaling the image to have appropriate dimensions.

Both of these solutions are undesirable. In the example below, the caterpillar looks distorted in the scaled versions (top right and bottom left), and in the cropped version (bottom right) it’s more difficult to tell that the caterpillar is on a leaf; we have lost the surrounding context.

In this post we’ll look at a nice heuristic method for rescaling images called seam-carving, which pays attention to the contents of the image as it resacles. In particular, it only removes or adds pixels to the image that the viewer is least-likely to notice. In all but the most extreme cases it will avoid the ugly artifacts introduced by cropping and scaling, and with a bit of additional scaffolding it becomes a very useful addition to a graphic designer’s repertoire. At first we will focus on scaling an image down, and then we will see that the same technique can be used to enlarge an image.

Before we begin, we should motivate the reader with some examples of its use.

It’s clear that the caterpillar is far less distorted in all versions, and even in the harshly rescaled version, parts of the green background are preserved. Although the leaf is warped a little, it is still present, and it’s not obvious that the image was manipulated.

Now that the reader’s appetite has been whet, let’s jump into the mathematics of it. This method was pioneered by Avidan and Shamir, and the impatient reader can jump straight to their paper (which contains many more examples). In this post we hope to fill in the background and show a working implementation.

## Images as Functions

One common way to view an image is as an approximation to a function of two real variables. Suppose we have an $n \times m$-pixel image ($n$ rows and $m$ columns of pixels). For simplicity (during the next few paragraphs), we will also assume that the pixel values of an image are grayscale intensity values between 0 and 255. Then we can imagine the pixel values as known integer values of a function $f: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}$. That is, if we take two integers $0 \leq x < n$ and $0 \leq y < m$ then we know the value $f(x,y)$; it’s just the intensity value at the corresponding pixel. For values outside these ranges, we can impose arbitrary values for $I$ (we don’t care what’s happening outside the image).

Moreover, it makes sense to assume that $f$ is a well-behaved function in between the pixels (i.e. it is differentiable). And so we can make reasonable guessed as to the true derivative of $f$ by looking at the differences between adjacent pixels. There are many ways to get a good approximation of the derivative of an image function, but we should pause a moment to realize why this is important to nail down for the purpose of resizing images.

A good rule of thumb with images is that regions of an image which are most important to the viewer are those which contain drastic changes in intensity or color. For instance, consider this portrait of Albert Einstein.

Which parts of this image first catch the eye? The unkempt hair, the wrinkled eyes, the bushy mustache? Certainly not the misty background, or the subtle shadows on his chin.

Indeed, one could even claim that an image having a large derivative at a certain pixel corresponds to high information content there (of course this is not true of all images, but perhaps it’s reasonable to claim this for photographs). And if we want to scale an image down in size, we are interested in eliminating those regions which have the smallest information content. Of course we cannot avoid losing some information: the image after resizing is smaller than the original, and a reasonable algorithm should not add any new information. But we can minimize the damage by intelligently picking which parts to remove; our naive assumption is that a small derivative at a pixel implies a small amount of information.

Of course we can’t just remove “regions” of an image to change its proportions. We have to remove the same number of pixels in each row or column to reduce the corresponding dimension (width or height, resp.). Before we get to that, though, let’s write a program to compute the gradient. For this program and the rest of the post we will use the Processing programming language, and our demonstrations will use the Javascript cross-compiler processing.js. The nice thing about Processing is that if you know Java then you know processing. All the basic language features are the same, and it’s just got an extra few native types and libraries to make graphics rendering and image displaying easier. As usual, all of the code used in this blog post is available on this blog’s Github page.

Let’s compute the gradient of this picture, and call the picture $I$:

A very nice picture whose gradient we can compute. It was taken by the artist Ria Czichotzki.

Since this is a color image, we will call it a function $I: \mathbb{R}^2 \to \mathbb{R}^3$, in the sense that the input is a plane coordinate $(x,y)$, and the output $I(x,y) = (r,g,b)$ is a triple of color intensity values. We will approximate the image’s partial derivative $\left \langle \partial I / \partial x, \partial I / \partial y \right \rangle$ at $(x,y)$ by inspecting values of $I$ in a neighborhood of the point:

$I(x-1,y), I(x+1, y), I(x,y-1), I(x,y+1)$.

For each pixel we call the value $|I(x+1,y) - I(x-1,y)| / 2$ the partial derivative in the $x$ direction, and $|I(x,y+1) - I(x,y-1)| / 2$ the partial in the $y$ direction. Note that the values $I(x,y)$ are vectors, so the norm signs here are really computing the distance between the two values of $I$.

There are two ways to see why this makes sense as an approximation. The first is analytic: by definition, the partial derivative $\partial I / \partial x$ is a limit:

$\displaystyle \lim_{h \to 0} \frac{|I(x+h,y) - I(x,y)|}{h}$

It turns out that this limit is equivalent to

$\displaystyle \lim_{h \to 0} \frac{|I(x+h,y) - I(x-h,y)|}{2h}$

And the closer $h$ gets to zero the better the approximation of the limit is. Since the closest we can make $h$ is $h=1$ (we don’t know any other values of $I$ with nonzero $h$), we plug in the corresponding values for neighboring pixels. The partial $\partial I / \partial y$ is similar.

The second way to view it is geometric.

The slope of the blue secant line is not a bad approximation to the derivative at x, provided the resolution is fine enough.

The salient fact here is that a nicely-behaved curve at $x$ will have a derivative close to the secant line between the points $(x-1, f(x-1))$ and $(x+1, f(x+1))$. Indeed, this idea inspires the original definition of the derivative. The slope of the secant line is just $(f(x+1) - f(x-1)) / 2$. As we saw in our post on numerical integration, we can do much better than a linear guess (specifically, we can use do any order of polynomial interpolation we wish), but for the purposes of displaying the concept of seam-carving, a linear guess will suffice.

And so with this intuitive understanding of how to approximate the gradient, the algorithm to actually do it is a straightforward loop. Here we compute the horizontal gradient (that is, the derivative $\partial I / \partial x$).

PImage horizontalGradient(PImage img) {
color left, right;
int center;
PImage newImage = createImage(img.width, img.height, RGB);

for (int x = 0; x < img.width; x++) {
for (int y = 0; y < img.height; y++) {
center = x + y*img.width;

left = x == 0 ? img.pixels[center] : img.pixels[(x-1) + y*img.width];
right = x == img.width-1 ? img.pixels[center] : img.pixels[(x+1) + y*img.width];

newImage.pixels[center] = color(colorDistance(left, right));
}
}

return newImage;
}


The details are a bit nit-picky, but the idea is simple. If we’re inspecting a non-edge pixel, then we can use the formula directly and compute the values of the neighboring left and right pixels. Otherwise, the “left” pixel or the “right” pixel will be outside the bounds of the image, and so we replace it with the pixel we’re inspecting. Mathematically, we’d be computing the difference $|I(x, y) - I(x+1, y)|$ and $|I(x-1,y) - I(x, y)|$. Additionally, since we’ll later only be interested in the relative sizes of the gradient, we can ignore the factor of 1/2 in the formula we derived.

The parts of this code that are specific to Processing also deserve some attention. Specifically, we use the built-in types PImage and color, for representing images and colors, respectively. The “createImage” function creates an empty image of the specified size. And peculiarly, the pixels of a PImage are stored as a one-dimensional array. So as we’re iterating through the rows and columns, we must compute the correct location of the sought pixel in the pixel array (this is why we have a variable called “center”). Finally, as in Java, the ternary if notation is used to keep the syntax short, and those two lines simply check for the boundary conditions we stated above.

The last unexplained bit of the above code is the “colorDistance” function. As our image function $I(x,y)$ has triples of numbers as values, we need to compute the distance between two values via the standard distance formula. We have encapsulated this in a separate function. Note that because (in this section of the blog) we are displaying the results in an image, we have to convert to an integer at the end.

int colorDistance(color c1, color c2) {
float r = red(c1) - red(c2);
float g = green(c1) - green(c2);
float b = blue(c1) - blue(c2);
return (int)sqrt(r*r + g*g + b*b);
}


Let’s see this in action on the picture we introduced earlier.

The reader who is interested in comparing the two more closely may visit this interactive page. Note that we only compute the horizontal gradient, so certain locations in the image have a large derivative but are still dark in this image. For instance, the top of the door in the background and the wooden bars supporting the bottom of the chair are dark despite the vertical color variations.

The vertical gradient computation is entirely analogous, and is left as an exercise to the reader.

Since we want to inspect both vertical and horizontal gradients, we will call the total gradient matrix $G$ the matrix whose entries $g_{i,j}$ are the sums of the magnitudes of the horizontal and vertical gradients at $i,j$:

$\displaystyle g_{i,j} = \left | \frac{\partial I}{\partial x} (i,j) \right | + \left | \frac{\partial I}{\partial y} (i,j) \right |$

The function $e(x,y) = g_{x,y}$ is often called an energy function for $I$. We will mention now that there are other energy functions one can consider, and use this energy function for the remainder of this post.

## Seams, and Dynamic Programming

Back to the problem of resizing, we want a way to remove only those regions of an image that have low total gradient across all of the pixels in the region removed. But of course when resizing an image we must maintain the rectangular shape, and so we have to add or remove the same number of pixels in each column or row.

For the purpose of scaling an image down in width (and the other cases are similar), we have a few options. We could find the pixel in each row with minimal total gradient and remove it. More conservatively, we could remove those columns with minimal gradient (as a sum of the total gradient of each pixel in the column). More brashly, we could just remove pixels of lowest gradient willy-nilly from the image, and slide the rows left.

If none of these ideas sound like they would work, it’s because they don’t. We encourage the unpersuaded reader to try out each possibility on a variety of images to see just how poorly they perform. But of these options, removing an entire column happens to distort the image less than the others. Indeed, the idea of a “seam” in an image is just a slight generalization of a column. Intuitively, a seam $s_i$ is a trail of pixels traversing the image from the bottom to the top, and at each step the pixel trail can veer to the right or left by at most one pixel.

Definition: Let $I$ be an $n \times m$ image with nonnegative integer coordinates indexed from zero. A vertical seam in $I$ is a list of coordinates $s_i = (x_i, y_i)$ with the following properties:

• $y_0 = 0$ is at the bottom of the image.
• $y_{n-1} = n-1$ is at the top of the image.
• $y_i$ is strictly increasing.
• $|x_i - x_{i+1}| \leq 1$ for all $0 \leq i < n-1$.

These conditions simply formalize what we mean by a seam. The first and second impose that the seam traverses from top to bottom. The third requires the seam to always “go up,” so that there is only one pixel in each row. The last requires the seam to be “connected” in the sense that it doesn’t veer too far at any given step.

Here are some examples of some vertical seams. One can easily define horizontal seams by swapping the placement of $x, y$ in the above list of conditions.

So the goal is now to remove the seams of lowest total gradient. Here the total gradient of a seam is just the sum of the energy values of the pixels in the seam.

Unfortunately there are many more seams to choose from than columns (or even individual pixels). It might seem difficult at first to find the seam with the minimal total gradient. Luckily, if we’re only interested in minima, we can use dynamic programming to compute the minimal seam ending at any given pixel in linear time.

We point the reader unfamiliar with dynamic programming to our Python primer on this topic. In this case, the sub-problem we’re working with is the minimal total gradient value of all seams from the bottom of the image to a fixed pixel. Let’s call this value $v(a,b)$. If we know $v(a,b)$ for all pixels below, say, row $i$, then we can compute the $v(i+1,b)$ for the entire row $i+1$ by taking pixel $(i+1,j)$, and adding its gradient value to the minimum of the values of possible predecessors in a seam, $v(i,j-1), v(i,j), v(i,j+1)$ (respecting the appropriate boundary conditions).

Once we’ve computed $v(a,b)$ for the entire matrix, we can look at the minimal value at the top of the image $\min_j v(n,j)$, and work backwards down the image to compute which seam gave us this minimum.

Let’s make this concrete and compute the function $v$ as a two-dimensional array called “seamFitness.”

void computeVerticalSeams() {
seamFitness = new float[img.width][img.height];
for (int i = 0; i < img.width; i++) {
}

for (int y = 1; y < img.height; y++) {
for (int x = 0; x < img.width; x++) {

if (x == 0) {
seamFitness[x][y] += min(seamFitness[x][y-1], seamFitness[x+1][y-1]);
} else if (x == img.width-1) {
seamFitness[x][y] += min(seamFitness[x][y-1], seamFitness[x-1][y-1]);
} else {
seamFitness[x][y] += min(seamFitness[x-1][y-1], seamFitness[x][y-1], seamFitness[x+1][y-1]);
}
}
}
}


We have two global variables at work here (global is bad, I know, but it’s Processing; it’s made for prototyping). The seamFitness array, and the gradientMagnitude array. We assume at the start of this function that the gradientMagnitude array is filled with sensible values.

Here we first initialize the zero’th row of the seamFitness array to have the same values as the gradient of the image. This is simply because a seam of length 1 has only one gradient value. Note here the coordinates are a bit backwards: the first coordinate represents the choice of a column, and the second represents the choice of a row. We can think of the coordinate axes of our image function having the origin in the bottom-left, the same as we might do mathematically.

Then we iterate over the rows in the matrix, and in each column we compute the fitness based on the fitness of the previous row. That’s it

To actually remove a seam, we need to create a new image of the right size, and shift the pixels to the right (or left) of the image into place. The details are technically important, but tedious to describe fully. So we leave the inspection of the code as an exercise to the reader. We provide the Processing code on this blog’s Github page, and show an example of its use below. Note each the image resizes every time the user clicks within the image.

Photograph by Raphael Goetter.

It’s interesting (and indeed the goal) to see how at first nothing is warped, and then the lines on the walls curve around the woman’s foot, and then finally the woman’s body is distorted before she gets smushed into a tiny box by the oppressive mouse.

As a quick side note, we attempted to provide an interactive version of this Processing program online in the same way we did for the gradient computation example. Processing is quite nice in that any Processing program (which doesn’t use any fancy Java libraries) can be cross-compiled to Javascript via the processing.js library. This is what we did for the gradient example. But in doing so for the (admittedly inefficient and memory-leaky) seam-carving program, it appeared to run an order of magnitude slower in the browser than locally. This was this author’s first time using Processing, so the reason for the drastic jump in runtime is unclear. If any readers are familiar with processing.js, a clarification would be very welcome in the comments.

## Inserting Seams, Removing Objects, and Videos

In addition to removing seams to scale an image down, one can just as easily insert seams to make an image larger. To insert a seam, just double each pixel in the seam and push the rest of the pixels on the row to the right. The process is not hard, but it requires avoiding one pitfall: if we just add a single seam at a time, then the seam with minimum total energy will never change! So we’ll just add the same seam over and over again. Instead, if we want to add $k$ seams, one should compute the minimum $k$ seams and insert them all. If the desired resize is too large, then the programmer should pick an appropriate batch size and add seams in batches.

Another nice technique that comes from the seam-carving algorithm is to intelligently protect or destroy specific regions in the image. To do this requires a minor modification of the gradient computation, but the rest of the algorithm is identical. To protect a region, provide some way of user input specifying which pixels in the image are important, and give those pixels an artificially large gradient value (e.g., the maximum value of an integer). If the down-scaling is not too extreme, the seam computations will be guaranteed not to use any of those pixels, and inserted seams will never repeat those pixels. To remove a region, we just give the desired pixels an arbitrarily low gradient value. Then these pixels will be guaranteed to occur in the minimal seams, and will be removed from the picture.

The technique of seam-carving is a very nice tool, and as we just saw it can be extended to a variety of other techniques. In fact, seam-carving and its applications to object removal and image resizing are implemented in all of the recent versions of Photoshop. The techniques are used to adapt applications to environments with limited screen space, such as a mobile phone or tablet. Seam carving can even be adapted for use in videos. This involves an extension of the dynamic program to work across multiple frames, formally finding a minimal graph cut between two frames so that each piece of the cut is a seam in the corresponding frame. Of course there is a lot more detail to it (and the paper linked above uses this detail to improve the basic image-resizing algorithm), but that’s the rough idea.

We’ve done precious little on this blog with images, but we’d like to get more into graphics programming. There’s a wealth of linear algebra, computational geometry, and artificial intelligence hiding behind most of the computer games we like to play, and it would be fun to dive deeper into these topics. Of course, with every new post this author suggests ten new directions for this blog to go. It’s a curse and a blessing.

Until next time!

# Learning Programming — Finger-Painting and Killing Zombies

Zmob, my first (and only) original game.

By the end, the breadth and depth of our collective knowledge was far beyond what anyone could expect from any high school course in any subject.

## Education Versus Exploration

I’m a lab TA for an introductory Python programming course this semester, and it’s been…depressing. I remember my early days of programming, when the possibilities seemed endless and adding new features to my programs was exciting and gratifying, and I brimmed with pride at every detail, and I boasted to my friends of the amazing things I did, and I felt powerful. The world was literally at my fingertips. I could give substance to any idea I cared to entertain and any facet of life I wanted to explore. I had developed an insatiable thirst for programming that has lasted to this very day.

My younger self, if programming were more noodley.

The ironic thing is that today I look back on the programs I wrote and cringe with unending embarrassment. My old code was the artistic equivalent of a finger-painting made by a kindergartener. Sure, it might look kind of like a horse, but only because the kid has no idea what he’s doing. The programs I wrote were bug-ridden, hard to read, poorly organized, and a veritable spaghetti-slop of logic.

But I can’t deny how much fun I had, and how much I learned. I will describe all of that in more detail below, but first I’d like to contrast my experience with the students I’m teaching this semester. Their labs are no more than mindlessly shuffling around data, and implementing dry, boring functions like Horner’s method for evaluating polynomials. And (amazingly) their projects are even less interesting. I think the biggest difference is that my students don’t have to actually solve any problems by writing programs. And so their experience is boiled down to following directions and pretending to be a computer in order to follow their own program’s logic.

I’m certainly not saying that following directions and simulating a computer in your head aren’t important things to be a good programmer. What I’m saying is that my students get no gratification from their work. Their results are just as dry as the problem, and the majority of the joy I see among them is when they finish a problem and don’t have to think about it anymore (even if their solution is completely wrong).

The course has other problems with it. For instance, the professor teaches the students C paradigms instead of Python paradigms (I don’t think he ever learned the right way to do things in Python), and he confuses them with talk of stack frames and registers and all sorts of irrelevant architectural details. Remember, these students have never programmed before, and some started the course just barely competent with a computer. I didn’t know what a stack frame was until I had three years of programming under my belt (two of those years were the early, experimental years).

All of this has gotten me thinking pretty regularly about how I might teach my own course, if I might ever have one. This post will roughly be an outline of how my own computer science education began. I’ll distill the most important aspects of it: the things that made me want to keep programming and the things that taught me deep ideas in natural contexts.

## My First Time was with Java

My high school (Campolindo High, in Moraga, CA) was blessed with a computer science course. With my early exposure to computers (at 3 years old, by my parents’ accounts), my love of video games, and my basic grasp of HTML, it seemed inevitable that I belonged in the class. In retrospect, it was perhaps the most beneficial course I ever took, followed closely by Honors/AP English, German, and public policy. Not only did it provide me with the aforementioned thirst for programming, but it planted a mathematical seed in my mind that would flourish years later like a giant bean stalk, which I’m still in the process of climbing today.

Right off the bat the course was different. The teacher’s name was Mr. Maters, and by the end of the first week we ceased to have lectures. Mr. Maters showed us barely enough to get a simple program running with input and output, and then we were left to our own devices.

There were roughly two options for getting credit. The first was to follow an outline of exercises and small projects from a book on GUI programming in Java. Most students followed these exercises for the first two months or so, and I did at least until I had made a stupid little pizza shop program that let you order pizzas.

The second option was wide open. The students could do whatever they wanted, and Mr. Maters jokingly told us, “At the end of each quarter I’ll judge your worth, and if I deem you deserve an A, you’ll get an A, but if I deem otherwise… you’ll get an F!”

Of course, Mr. Maters was the nicest guy you ever met. He would calmly sit at his computer in the front of the lab, maintaining a help queue of students with questions. He would quietly and calmly listen to a student’s question, and then shed some insight into their problem. Mr. Maters would get a better idea of a student’s progress by how frequent and how poignant their questions were, and more often than not students who were waiting in the queue would solve their own problems before getting to the front.

Most of the students in the class chose the “wide open” route, and that meant designing games. I’m not talking about good games, mind you, I’m talking about games made by high schoolers. I made the ugliest Whack-a-Mole you ever saw, a lifeless AI for a Battleship game, and a video poker game that featured Mr. Maters’s face on the back of every card (and replacing the faces of the kings, queens, and jacks). For the last one, I even collaborated with other students who made card games by jointly designing the Maters-themed deck. Collaboration would become a bigger theme the second year I took the course (yes, I took the same course twice), but before we get there, there were some other indispensable features I want to mention.

First, the lab room was set up so that Mr. Maters could remotely control any computer in the room from his desk. The program he used was dubbed the reverent name, “Vision,” and the slackers feared its power. Vision allowed Mr. Maters to look at our code while we were asking him questions at the front, and also helped him monitor students’ progress. Second, we were allowed a shared drive on the school’s network so that we could instantly pass files back and forth between lab computers. This had a few direct learning benefits, like sharing code examples, sprites, and sound files we used in our programs. But more importantly it gave a sense of culture to the class. We occasionally had contests where every student in the class submitted a picture of Maters’s face photoshopped into some ridiculous and funny scene (really, MS-Painted into a scene). Recall, this was the early days of internet memes, and naturally we youngsters were at the forefront of it.

Third, we were given “exploration” days every so often, on which we were freed from any obligation to work. We could play games, browse around online, or just sit and talk. More often than not we ended up playing LAN Unreal Tournament, but by the end of my second year I chose to spend those days working on my programs too; my games turned out to be more fun to work on than others were to play.

All of these instilled a sense of relaxation in the course. We weren’t being taught to the midterms or the AP exam (although I did take the AP eventually, and I scored quite well considering I didn’t study at all for it). We weren’t even being told to work on some days. The course was more of a community, and even as we teased Mr. Maters we revered him as a mentor.

As I did, most students who took a first year in the course stuck around for a second year. And that was when the amazing projects started to come.

## Zmob

The second year in the computer science class was all games all the time. Moreover, we started by looking at real-time games, the kinds of side-scrolling platformers we loved to play ourselves (yeah, Super Mario Brothers and Donkey Kong Country). I tried my hand at one, but before long I was lost in figuring out how to make the collisions work. Making the levels and animating the character and making the screen scroll were all challenging but not beyond my reach.

One of my early side-scrollers based on the Starfox series.

But once I got fed up with getting him to jump on blocks, I found a better project: Zmob (short for Zombie Mob). It was inspired by collaboration. I helped a friend nail down how to draw two circles in a special way: one was large and fixed, and the other was smaller, always touching the first, and the line between their two centers went through the position of the mouse. In other words, the smaller circle represented the “orientation” of the pair of circles, and it was always facing toward the mouse. It was a basic bit of trigonometry, but once I figured out how to do it I decided a top-down zombie shooting game would be fun to work on. So I started to it. Here’s the opening screen of an early version (typos and errors are completely preserved):

The intro screen to Zmob 1.0

In the game you control the black circle, and the grey circle is supposed to represent the gun. Zombies (blue circles) are spawned regularly at random positions and they travel at varying speeds directly toward your character. You can run around and if you hold down Shift you can run faster than them for a time. And of course, you shoot them until they die, and the game ends when you die. The number of zombies spawned increases as you go on, and your ammunition is limited (although you can pick up more ammo after you get a certain number of kills) so you will eventually die. To goal is to get a high score.

The game plays more like reverse-shepherding than a shooter, and while it might be hard, I don’t think anyone but me would play it for more than ten minutes at a time.

The important part was that I had a lot of ideas, and I needed to figure out how to make those ideas a reality. I wanted the zombies to not be able to overlap each other. I wanted a gun that poisoned zombies and when a poisoned zombie touched a healthy zombie, the healthy one became poisoned. I wanted all sorts of things to happen, and the solutions naturally became language features of Java that I ended up using.

The poison gun. White zombies are poisoned, while blue zombies are healthy.

For instance, at first I just represented the zombies as circles. There was no information that made any two zombies different, so I could store them as a list of x,y coordinates. Once I wanted to give them a health bar, and give them variable speeds, and poison them, I was forced to design a zombie class, so that I could give each zombie an internal state (poisoned or not, fast or slow, etc.). I followed this up by making a player class, an item class, and a bullet class.

And the bullets turned out to be my favorite part. I wanted every bullet on the screen to be updated just by me calling an “update()” function. It turns out this was the equivalent of making a bullet into an interface which each specialized bullet class inherited from. Already I saw the need and elegance behind object oriented programming, something that was totally lost on me when I made those stupid “Shape” interfaces they have you do in basic tutorials. I was solving a problem I actually needed to solve, and an understanding of inheritance was forever branded into my mind.

And the bullet logic was a joy in itself. The first three guns I made were boring: a pistol, a machine gun, and a shotgun. Each sprayed little black circles in the expected way. I wanted to break out and make a cool gun. I called my first idea the wave beam.

The wave beam: sinusoidal bullets.

The idea behind the wave beam is  that the bullets travel along a sinusoidal curve in the direction they were shot. This left me with a huge difficulty, though: how does one rotate a sine wave by an arbitrary angle? I had x and y coordinates for the bullets, but all of the convoluted formulas I randomly tried using sines and cosines and tangents ended up failing miserably. The best I could get was a sort of awkwardly-stretched sideways sine.

After about a week of trying with no success, I finally went to my statistics teacher at the time (whom I still keep in touch with) and I asked him if he knew of any sort of witchcraft mathemagic that would do what I wanted.

After a moment’s thought, he pulled out a textbook and showed me a page on rotation matrices. To my seventeen-year-old eyes, the formula was as mysterious as an ancient rune:

My particular code ended up looking like:

x += frame*Math.cos(angle) + Math.sin(frame)*Math.sin(angle)
y += frame*Math.sin(angle) + Math.sin(frame)*Math.cos(angle)

When I ran the code, it worked so perfectly I shouted out loud. After my week of struggle and botched attempts to figure this out, this solution was elegant and beautiful and miraculous. After that, I turned to calculus to make jumping look more natural in my Fox side-scroller. I experimented with other matrix operations like shearing and stretching. By the end of that year, I had a better understanding of a “change of basis” (though I didn’t know the words for it) than most of the students I took linear algebra with in college! It was just a different coordinate system for space; there were rotated coordinates, fat and stretchy coordinates, along with skinny and backward coordinates. I tried all sorts of things in search of fun gameplay.

And it wasn’t just mathematics that I learned ahead of my time. By the end of the year I had “finished” the game. I designed a chain gun that set off chain reactions when it hit zombies, I had given it a face lift with new graphics for the player and zombies. I even designed a smart tile-layout system to measure the size of the screen and display the background appropriately. I had gotten tired of measuring the sizes by hand, so I wrote a program to measure it for me. That sounds trivial, but it’s really the heart of problem solving in computer science.

The whole class “beta tested” it, i.e., spent a few days of class just playing it to have fun and find bugs. And they found lots of bugs. Overt ones (divide by zero errors making bullets go crazy) and subtler ones (if you time everything right, the zombies can’t get close enough to hurt you, and just keep bumping into each other).

One pretty important issue that came up was speed. Once I added images, I decided to use a Java library to rotate the images on every frame so they were pointing in the right direction. Now some people say Java is slow, but this part was really slow, especially when it got up to a hundred or more zombies. My solution, it just so happened, was a paradigm in computer science called caching. You pre-compute all of the rotations you’ll ever need ahead of time, and then store them somewhere. In fact, what I really did was called lazy-loading, a slightly more sophisticated technique that involved only storing the computed rotations once they were needed.

And I never learned the name of this technique until I got to a third-year college course in dynamic web programming when we discussed the Hibernate object-relational mapping for databases! Just like with linear algebra, my personalized problems resulted in me reinventing or rediscovering important concepts far earlier than I would have learned them otherwise. I was giving myself a deep understanding of the concepts and what sorts of problems they could solve.  This is distinctly different from the sort of studying that goes on in college: students memorize the name of a concept and what it means, but only the top students get a feel for why it’s important and when to use it.

## An Honest Evaluation in Retrospect

I’ll admit it, I was more dedicated to my work than the average kid. A small portion of the class was only engaged in the silly stuff. Some students didn’t have a goal in mind, or they did but didn’t pursue the issue with my kind of vigor. We didn’t have access to many good examples outside of our own web browsing and the mediocre quality of the books Mr. Maters had on hand. The choice of Java was perhaps a steep learning curve for some, but I think in the end it did more good than harm.

But on the other hand, those of us that did work well got to explore, and absorb the material at our own pace. We got to struggle with problems we actually wanted to solve, leading to new insights. One of my classmates made a chat client and a networked version of Tron, while others made role-playing games, musical applications, encryption algorithms, painting programs, and much more. By the end, the breadth and depth of our collective knowledge was far beyond what anyone could expect from any high school course in any subject. I don’t say that lightly; I spent a lot of time analyzing literature and debating contemporary issues and memorizing German vocabulary and fine-tuning essays and doing biology experiments, but programming was different. It was engaging and artistic and technical and logical and visceral. Moreover, it was a skill that makes me marketable. I could drop out of graduate school today and find a comfortable job as a software engineer in any major city and probably in any industry that makes software. That class was truly what set me on the path to where I am today.

And worst of all, it absolutely breaks my heart  to hear my students say “I didn’t think programming would be like this. I’m just not cut out for it.” The best response I can muster is “Don’t judge programming by this class. It can be fun, truly.”

## What They Need

It’s become woefully clear to me that to keep students interested in programming, they need a couple of things:

1. Instant gratification

My students spend way too much time confused about their code. They need have some way to make a change and see the effects immediately. They need teaching tools designed by Bret Victor (skip ahead to 10:30 in the video to see what I mean, but the whole thing is worth watching). And they need to work on visual programs. Drawing programs and games and music. Programs whose effects they can experience in a non-intellectual way, as opposed to checking whether they’re computing polynomial derivatives correctly.

2. Projects that are relevant, or at least fun.

Just like when I was learning, student need to be able to explore. Let them work on their own projects, and have enough knowledge as a teacher to instruct them when they get stuck (or better yet, brainstorm with them). If everyone having a customized project is out of the question, at least have them work on something relevant. The last two projects in the class I teach were regrettably based on file input/output and matrix sums. Why not let them work on a video game, or a search engine (it might sound complicated, but that’s the introductory course over at udacity), or some drawing/animation, a chat client, solve Sudoku puzzles, or even show them how to get data from Facebook with the Graph API. All of these things can be sufficiently abstracted so that a student at any level can handle it, and each requires the ability to use certain constructs (basic networking for a chat client, matrix work for a sudoku, file I/O in parts of a search engine, etc.). Despite the wealth of interesting things they could have students do, it seems to me that the teachers just don’t want to come up with interesting projects, so they just have their students compute matrix sums over and over and over again.

3. The ability to read others’ code.

This is an integral part of learning. Not only should students’ be able to write code, but they must be able to read foreign code. They have to be able to look at examples and extract the important parts to use in their own original work. They have to be able to collaborate with their classmates, work on a shared project, and brainstorm new ideas while discussing bugs. They have to be able to criticize code as they might criticize a movie or a restaurant. Students should be very opinionated about software, and they should strive to find the right way to do things, openly lampooning pieces of code that are bloated or disorganized (okay, maybe not too harshly, but they should be mentally aware).

These three things lie at the heart of computer science and software development, and all of the other crap (the stack frames and lazy-loading and linux shells) can wait until students are already hooked and eager to learn more. Moreover, it can wait until they have a choice to pursue the area that requires knowledge of the linux shell or web frameworks or networking security or graphics processing. I learned all of the basics and then some without ever touching a linux terminal or knowing what a bit was. I don’t doubt my current students could do the same.

And once students get neck deep in code (after spending a year or two writing spaghetti code programs like I did), they can start to see beauty in the elegant ways one can organize things, and the expressive power one has to write useful programs. In some sense programming is like architecture: a good program has beauty in form and function. That’s the time when they should start thinking about systems programming and networking, because that’s the time when they can weigh the new paradigms against their own. They can criticize and discuss and innovate, or at least appreciate how nice it is and apply the ideas to whatever zombie-related project they might be working on at the time.

I hold the contention that every computer science curriculum should have multiple courses that function as blank canvases, and they should have one early on in the pipeline (if not for the very first course). I think that the reason classes aren’t taught this way is the same reason that mathematics education is what it is: teaching things right is hard work! As sad as it sounds, professors (especially at a research institution) don’t have time to design elaborate projects for their students.

And as long as I’m in the business of teaching, I’ll work to change that. I’ll design courses to be fun, and help my future coworkers who fail to do so. Even in highly structured courses, I’ll give students an open-ended project.

So add that onto my wish list as a high school teacher: next to “Math Soup for the Teenage Soul,” a class called “Finger-paint Programming.” (or “Programming on a Canvas”? “How to Kill Zombies”? Other suggested titles are welcome in the comments :))

# Numerical Integration

## Rectangles, Trapezoids, and Simpson’s

I just wrapped up a semester of calculus TA duties, and I thought it would be fun to revisit the problem of integration from a numerical standpoint. In other words, the goal of this article is to figure out how fast we can approximate the definite integral of a function $f:\mathbb{R} \to \mathbb{R}$. Intuitively, a definite integral is a segment of the area between a curve $f$ and the $x$-axis, where we allow area to be negative when $f(x) < 0$.

If any of my former students are reading this, you should note that we will touch on quite a number of important ideas from future calculus courses, but until you see them in class, the proofs may seem quite mystical.

As usual, the source code for this post is available on this blog’s Github page.

## The Baby Integral

Let’s quickly recall the definition of an integral $\int_a^b f(x) dx$:

Definition: Let $P = \left \{ [a_i, b_i] \right \}_{i = 1}^n$ be a partition of an interval $[a,b]$ into $n$ sub-intervals, let $\Delta x_i = b_i - a_i$ be the length of each interval, and let $\widehat{x_i}$ be a chosen point inside $[a_i,b_i]$ (which we call a tag). Then a Riemann sum of $f$ from $a$ to $b$ is a sum

$\displaystyle R(f, P) = \sum_{i = 1}^n f(\widehat{x_i}) \Delta x_i$.

Geometrically, a Riemann sum approximates the area under the curve $f$ by using sufficiently small rectangles whose heights are determined by the tagged points. The terms in the sum above correspond to the areas of the approximating rectangles.

We note that the intervals in question need not have the same lengths, and the points $\widehat{x_i}$ may be chosen in any haphazard way one wishes. Of course, as we come up with approximations, we will pick the partition and tags very deliberately.

Definition: The integral $\displaystyle \int_a^b f(x) dx$ is the limit of Riemann sums as the maximum length of any sub-interval in the partition goes to zero. In other words, for a fixed partition $P$ let $\delta_P = \max_i(\Delta x_i)$. Then

$\displaystyle \int_a^b f(x) dx = \lim_{\delta_P \to 0}R(f, P)$

Another way to put this definition is that if you have any sequence of partitions $P_k$ so that $\delta_{P_k} \to 0$ as $k \to \infty$, then the integral is just the limit of Riemann sums for this particular sequence of partitions.

Our first and most naive attempt at computing a definite integral is to interpret the definition quite literally. The official name for it is a left Riemann sum. We constrain our partitions $P$ so that each sub-interval has the same length, namely $\Delta x = (b-a)/n$. We choose our tags to be the leftmost points in each interval, so that if we name each interval $[a_i, b_i]$, we have $\widehat{x_i} = a_i$. Then we simply use a large enough value of $n$, and we have a good approximation of the integral.

For this post we used Mathematica (gotta love those animations!), but the code to implement this is quite short in any language:

LeftRiemannSum[f_, n_, a_, b_] :=
Module[{width = (b-a)/n},
N[width * Sum[f[a + i*width], {i, 0, n-1}]]
];

Note that we may factor the constant “width” out of the sum, since here it does not depend on the interval. The only other detail is that Mathematica leaves all expressions as exact numbers, unless they are wrapped within a call to N[ ], which stands for “numerical” output. In most general languages numerical approximations are the default.

The computational complexity should be relatively obvious, as we require “one” computation per interval, and hence $n$ computations for $n$ intervals. (Really, it depends on the cost of evaluating $f(x)$, but for the sake of complexity we can assume computing each term is constant.) And so this algorithm is $O(n)$.

However, we should note that our concern is not necessarily computational complexity, but how fast the sum converges. In other words, we want to know how large $n$ needs to be before we get a certain number of decimal places of accuracy.

For all of our tests and visualizations, we will use the following arbitrarily chosen, but sufficiently complicated function on $[0, \pi]$:

$\displaystyle f(x) = 5 \cos(x) \sin^10(x) + \frac{1}{5} \cos^9(x) \exp{\sqrt{x}}$

Our test function.

For n = 15, we have the following left Riemann sum:

A left Riemann Sum with n = 15

Here’s an animation of the sum as $n \to 100$:

An animation of a left Riemann sum where n goes from 2 to 100

Unfortunately, it seems that on the regions where $f$ is increasing, that portion of the Riemann sum is an underestimate, and where $f$ is decreasing, the Riemann sum is an overestimate. Eventually this error will get small enough, but we note that even for $n = 10,000$, the sum requires almost 7 seconds to compute, and only achieves 3 decimal places of accuracy! From a numerical standpoint, left Riemann sums converge slower than paint dries and are effectively useless. We can certainly do better.

## More Geometric Intuition

Continuing with the geometric ideas, we could conceivably pick a better tag in each sub-interval. Instead of picking the left (or right) endpoint, why not pick the midpoint of each sub-interval? Then the rectangles will be neither overestimates nor underestimates, and hence the sums will be inherently more accurate. The change from a left Riemann sum to a midpoint Riemann sum is trivial enough to be an exercise for the reader (remember, the source code for this post is available on this blog’s Github page). We leave it as such, and turn to more interesting methods.

Instead of finding the area of a rectangle under the curve, let’s use a trapezoid whose endpoints are both on the curve. (Recall the area of a trapezoid, if necessary) We call this a trapezoid sum, and a first attempt at the code is not much different from the left Riemann sum:

TrapezoidSum[f_, n_, a_, b_] :=
Module[{width = (b-a)/n},
N[width * Sum[1/2 (f[a + i*width] + f[a + (i+1)*width]),
{i, 0, n-1}]]
];

Here is a picture for $n = 15$:

A trapezoid sum for n = 15

And an animation as $n \to 100$:

An animation of the trapezoid sum as n goes to 100

The animation hints that this method converges much faster than left Riemann sums, and indeed we note that for $n = 100$, the sum requires a mere .16 seconds, yet achieves the three decimal places of accuracy for which the left Riemann sum required $n = 10,000$. This method appears to be a drastic improvement, and indeed plotting the accuracy of left Riemann sums against trapezoid sums gives a nice indication:

Errors of the left Riemann sums (blue, positive) and the trapezoid sums (red, negative) for increasing values of n.

Now that is quite an improvement!

Going back to the code, the computational complexity is again $O(n)$, but we note that at a micro-efficiency level, we are being a bit wasteful. We call the function $f$ twice for each trapezoid, even when adjacent trapezoids share edges and hence base heights. If a call to $f$ is relatively expensive (and it just so happens that calls to Sin, Cos, Exp are somewhat expensive), then this becomes a significant issue. We leave it as an exercise to the adventurous reader to optimize the above code, so that no superfluous calls to $f$ are made (hint: surround the function $f$ with a cache, so that you can reuse old computations).

Before we move on to our final method for this post, we will take a no-so-short aside to give a proof of how accurate the trapezoid rule is. In fact, we will give an upper bound on the error of trapezoid sums based solely on $n$ and easy properties of $f$ to compute. In it’s raw form, we have the following theorem:

Theorem: Supposing $f: [a,b] \to \mathbb{R}$ is thrice differentiable, let $h = (b-a)/n$, let $a_i$ be $a + (i-1)h$, and let $T_n(f)$ be a trapezoidal approximation of $f$ with $n$ trapezoids, as above. Then

$\displaystyle T_n(f) - \int_{a}^b f(x) dx = \frac{b-a}{n} h^2 f''(c)$ for some $c \in [a,b]$

Proof. Let $\varphi_i : [0,h] \to \mathbb{R}$ be defined by

$\displaystyle \varphi_i(t) = \frac{t}{2}(f(a_i) + f(a_i + t)) - \int_{a_i}^{a_i + t} f(x) dx$

We claim that $\sum \limits_{i=1}^n \varphi_i(h) = T_n(f) - \int_a^b f(x) dx$, and one can see this by simply expanding the sum according to the definition of $\varphi_i$. Now we turn to the question of bounding $\varphi_i$ on $[0,h]$.

We note $\varphi_i(0) = 0$, and by the fundamental theorem of calculus:

$\displaystyle \varphi_i'(t) = \frac{1}{2}(f(a_i) - f(a_i + t)) + \frac{t}{2}f'(a_i + t)$

Furthermore, $\varphi_i'(0) = 0$ as is evident by the above equation, and differentiating again gives us

$\displaystyle \varphi_i''(t) = \frac{t}{2}f''(a_i + t)$

With again $\varphi_i''(0) = 0$.

As $f''$ is continuous, the extreme-value theorem says there exist bounds for $f$ on $[a,b]$. We call the lower one $m$ and the upper $M$, so that

$\displaystyle \frac{1}{2}mt \leq \varphi_i''(t) \leq \frac{1}{2}Mt$

Taking definite integrals twice on $[0,t]$, we get

$\displaystyle \frac{1}{12}mt^3 \leq \varphi_i(t) \leq \frac{1}{12}Mt^3$

Then the sum of all the $\varphi_i(h)$ may be bounded by

$\displaystyle \frac{n}{12}mh^3 \leq \sum \limits_{i=1}^n \varphi_i(h) \leq \frac{n}{12}Mh^3$

The definition of $h$ and some simplification gives

$\displaystyle \frac{b-a}{12}mh^2 \leq \sum \limits_{i=1}^n \varphi_i(h) \leq \frac{(b-a)}{12}Mh^2$

And from here we note that by continuity, $f''(x)$ obtains every value between its bounds $m, M$, so that for some $c \in [a,b], f''(c)$ obtains the value needed to make the middle term equal to $\frac{b-a}{12}h^2 f''(c)$, as desired.

As a corollary, we can bound the magnitude of the error by using the larger of $|m|, |M|$, to obtain a fixed value $B$ such that

$\displaystyle \left | T_n(f) - \int_a^b f(x) dx \right | \leq \frac{(b-a)}{12}h^2 B$

$\square$

So what we’ve found is that the error of our trapezoidal approximation (in absolute value) is proportional to the function $1/n^2$. There is a similar theorem about the bounds of a left Riemann sum, and we leave it as an exercise to the reader to find and prove it (hint: use a similar argument, or look at the Taylor expansion of $f$ at the left endpoint).

## Interpolating Polynomials and Simpson’s Rule

One way to interpret the left Riemann sum is that we estimate the integral by integrating a step function which is close to the actual function. For the trapezoidal rule we estimate the function by piecewise lines, and integrate that. Naturally, our next best approximation would be estimating the function by piecewise quadratic polynomials. To pick the right ones, we should investigate the idea of an interpolating polynomial.

Following the same pattern, given one point there is a unique constant function (degree zero polynomial) passing through that point. Given two points there is a unique line (degree one) which contains both. We might expect that given $n+1$ points in the plane (in general position), there is a unique degree $n$ polynomial passing through them.

For three points, $(x_1, y_1), (x_2, y_2), (x_3, y_3)$ we may concoct a working curve as follows (remember $x$ is the variable here)

$\displaystyle \frac{(x-x_1)(x-x_2)}{(x_3-x_1)(x_3-x_2)}y_3 + \frac{(x-x_1)(x-x_3)}{(x_2-x_1)(x_2-x_3)}y_2 + \frac{(x-x_2)(x-x_3)}{(x_1-x_2)(x_1-x_3)}y_1$

Notice that each of the terms are 2nd degree, and plugging in any one of the three given $x$ values annihilates two of the terms, and gives us precisely the right $y$ value in the remaining term. We may extend this in the obvious way to establish the interpolating polynomial for a given set of points. A proof of uniqueness is quite short, as if $p, q$ are two such interpolating polynomials, then $p-q$ has $n+1$ roots, but is at most degree $n$. It follows from the fundamental theorem of algebra that $p-q$ must be the zero polynomial.

If we wish to approximate $f$ with a number of these quadratics, we can simply integrate the polynomials which interpolate $(a_i, f(a_i)), (a_{i+1}, f(a_{i+1})), (a_{i+2}, f(a_{i+2}))$, and do this for $i = 1, 3, \dots, n-2$. This is called Simpson’s Rule, and it gives the next level of accuracy for numerical integration.

With a bit of algebra, we may write the integrals of the interpolating polynomials in terms of the points themselves. Without loss of generality, assume the three points are centered at 0, i.e. the points are $(- \delta, y_1), (0, y_2), (\delta, y_3)$. This is fine because shifting the function left or right does not change the integral. Then the interpolating polynomial is (as above),

$\displaystyle p(x) = \frac{(x+\delta)(x)}{2 \delta^2}y_3 + \frac{(x+\delta)(x-\delta)}{-\delta^2}y_2 + \frac{(x)(x-\delta)}{2 \delta^2}y_1$.

Integrating this over $[-\delta, \delta]$ and simplifying gives the quantity

$\displaystyle \int_{-\delta}^{\delta} p(x) dx = \frac{\delta}{3}(y_1 + 4y_2 + y_3)$.

And so we may apply this to each pair of sub-intervals (or work with the midpoints of our $n$ sub-intervals), to get our quadratic approximation of the entire interval. Note, though that as we range over all sub-intervals, the two endpoints will be counted twice, so our entire approximation is

$f(a) + 2f(a+\delta) + 4f(a+ 2\delta) + 2f(a + 3\delta) + \dots + 2f(a + (2n-1) \delta) + f(b)$

Translating this into code is as straightforward as one could hope:

SimpsonsSum[f_, n_, a_, b_] :=
Module[{width = (b-a)/n, coefficient},
coefficient[i_?EvenQ] = 2;
coefficient[i_?OddQ] = 4;
N[width/3 * (f[a] + f[b] +
Sum[coefficient[i] * f[a + i*width], {i, 1, n-1}])]
];

As usual, here is a picture for $n = 8$:

Simpson’s Rule for n=8 (4 interpolated quadratics)

And an animation showing the convergence:

An animation showing the convergence of Simpson’s Rule

There is a similar bound for Simpson’s Rule as there was for the trapezoid sums. Here it is:

Theorem: Supposing $f$ is five times differentiable, and let $B$ be the maximum of the values $|f^{(4)}(c)|$ for $c$ in $[a,b]$. Then the magnitude of the difference between the true integral $\int_a^b f(x) dx$ and the Simpson’s Rule approximation with $n$ interpolated polynomials is at most

$\displaystyle \frac{(b-a)^5}{180 n^4}B$

The proof is similar in nature to the proof for trapezoid sums, but requires an annoying amount of detail. It’s quite difficult to find a complete proof of the error estimate for reference. This is probably because it is a special case of a family of Newton-Cotes formulas. We leave the proof as a test of gumption, and provide a reference to a slightly more advanced treatment by Louis A. Talman.

As an easier cop-out, we show a plot of the convergence of Simpson’s Rule versus the trapezoid sums:

The error convergence of Simpson’s Rule (red, above), versus the trapezoid sums (blue, below) for increasing values of n.

Judging by the graph, the improvement from trapezoid sums to Simpson’s rule is about as drastic as the improvement from Riemann sums to trapezoid sums.

## Final Thoughts, and Future Plans

There are a host of other integral approximations which fall in the same category as what we’ve done here. Each is increasingly accurate, but requires a bit more computation in each step, and the constants involved in the error bound are based on larger and larger derivatives. Unfortunately, in practice it may hard to bound large derivatives of an arbitrary function, so confidence in the error bounds for simpler methods might be worth the loss of efficiency for some cases.

Furthermore, we always assumed that the length of each sub-interval was uniform across the entire partition. It stands to reason that some functions are wildly misbehaved in small intervals, but well-behaved elsewhere. Consider, for example, $\sin(1/x)$ on $(0, \pi)$. It logically follows that we would not need small intervals for the sub-intervals close to $\pi$, but we would need increasingly small intervals close to 0. We will investigate such methods next time.

We will also investigate the trials and tribulations of multidimensional integrals. If we require 100 evenly spaced points to get a good approximation of a one-dimensional integral, then we would require $100^2$ evenly spaced points for a two-dimensional integral, and once we start working on interesting problems in 36,000-dimensional space, integrals will require $100^{36,000}$ evenly spaced points, which is far greater than the number of atoms in the universe (i.e., far exceeds the number of bits available in all computers, and hence cannot be computed). We will investigate alternative methods for evaluating higher-dimensional integrals, at least one of which will be based on random sampling.

Before we close, we note that even today the question of how to approximate integrals is considered important research. Within the last twenty years there have been papers generalizing these rules to arbitrary spaces, and significant (“clever”) applications to the biological sciences. Here are two examples: Trapezoidal rules in Sobolev spaces, and  Trapezoidal rule and Simpson’s rule for “rough” continuous functions. Of course, as we alluded to, when dimension goes up integrals become exponentially harder to compute. As our world is increasingly filled with high-dimensional data, rapid methods for approximating  integrals in arbitrary dimension is worth quite a bit of money and fame.

Until next time!