Simulating a Biased Coin with a Fair Coin

This is a guest post by my friend and colleague Adam Lelkes. Adam’s interests are in algebra and theoretical computer science. This gem came up because Adam gave a talk on probabilistic computation in which he discussed this technique.

Problem: simulate a biased coin using a fair coin.

Solution: (in Python)

def biasedCoin(binaryDigitStream, fairCoin):
   for d in binaryDigitStream:
      if fairCoin() != d:
         return d

Discussion: This function takes two arguments, an iterator representing the binary expansion of the intended probability of getting 1 (let us denote it as p) and another function that returns 1 or 0 with equal probability. At first glance this might seem like an overcomplicated way of solving this problem: why can’t the probability be a floating point number?

The point is that p can have infinite precision! Assuming that fairCoin() gives us a perfectly random stream of 1’s and 0’s (independently and with probability 1/2) and we can read each bit of the binary expansion of p, this function returns 1 with probability exactly p even if p is irrational or a fraction with infinite decimal expansion. If we used floating point arithmetic there would be a small chance we get unlucky and exhaust the precision available. We would only get an approximation of the true bias at best.

Now let us explain why this algorithm works. We keep tossing our fair coins to get a sequence of random bits, until one of our random bits is different from the corresponding bit in the binary expansion of p. If we stop after i steps, that means that the first i-1 bits in the two binary sequences were the same, which happens with probability \frac{1}{2^{i-1}}. Given that this happens, in the ith step we will return the ith bit of p; let us denote this bit by p_i. Then the probability of returning 1 is \sum_{i=1}^\infty \frac{p_i}{2^{i-1}}, which is the binary expansion of p.

This algorithm is also efficient. By efficient here we mean that the expected running time is constant. Of course, to show this we need to make some assumption about the computational complexity of calculating the bits of p. If we assume that the bits of p are efficiently computable in the sense that the time required to compute p_i is bounded by a polynomial in i, then this algorithm does run in constant expected time.

Indeed, the expected running time is \sum_{i=0}^\infty \frac{i^n}{2^i}. Showing that this sum is a constant is an easy calculus exercise: using the ratio test we get that

\displaystyle \textup{limsup}_{i \to \infty} \left | \frac{\frac{(i+1)^n}{2^{i+1}}}{\frac{i^n}{2^i}} \right | = \limsup_{i\to\infty} \frac{\left(\frac{i+1}{i}\right)^n}{2} = \frac{1}{2} < 1,

thus the series is convergent.

Now that we proved that our algorithm works, it’s time to try it! Let’s say that we want to simulate a coin which gives “heads” with probability 1/3.
We need to construct our binary digit stream. Since 1/3 is 0.010101… in binary, we could use the following simple generator:

def oneThird():
   while True:
      yield 0
      yield 1

However, we might want to have a more general generator that gives us the binary representation of any number. The following function, which takes a number between 0 and 1 as its argument, does the job:

def binaryDigits(fraction):
   while True:
      fraction *= 2
      yield int(fraction)
      fraction = fraction % 1

We also need a fair coin simulator. For this simulation, let’s just use Python’s built-in pseudo-random number generator:

def fairCoin():
   return random.choice([0,1])

Let us toss our biased coin 10000 times and take the sum. We expect the sum to be around 3333. Indeed, when I tried

>>> sum(biasedCoin(oneThird(), fairCoin) for i in range(10000))
3330

It might be worth noting oneThird() is approximately ten times faster than binaryDigits(fractions.Fraction(1,3)), so when a large number of biased coins is needed, you can hardwire the binary representation of p into the program.

About these ads

The Two-Dimensional Fourier Transform and Digital Watermarking

We’ve studied the Fourier transform quite a bit on this blog: with four primers and the Fast Fourier Transform algorithm under our belt, it’s about time we opened up our eyes to higher dimensions.

Indeed, in the decades since Cooley & Tukey’s landmark paper, the most interesting applications of the discrete Fourier transform have occurred in dimensions greater than 1. But for all our work we haven’t yet discussed what it means to take an “n-dimensional” Fourier transform. Our past toiling and troubling will pay off, though, because the higher Fourier transform and its 1-dimensional cousin are quite similar. Indeed, the shortest way to describe the n-dimensional transform is as the 1-dimensional transform with inner products of vector variables replacing regular products of variables.

In this post we’ll flush out these details. We’ll define the multivariable Fourier transform and it’s discrete partner, implement an algorithm to compute it (FFT-style), and then apply the transform to the problem of digitally watermarking images.

As usual, all the code, images, and examples used in this post are available on this blog’s Github page.

Sweeping Some Details Under the Rug

We spent our first and second primers on Fourier analysis describing the Fourier series in one variable, and taking a limit of the period to get the Fourier transform in one variable. By all accounts, it was a downright mess of notation and symbol manipulation that culminated in the realization that the Fourier series looks a lot like a Riemann sum. So it was in one dimension, it is in arbitrary dimension, but to save our stamina for the applications we’re going to treat the n-dimensional transform differently. We’ll use the 1-dimensional transform as a model, and magically generalize it to operate on a vector-valued variable. Then the reader will take it on faith that we could achieve the same end as a limit of some kind of multidimensional Fourier series (and all that nonsense with Schwarz functions and tempered distributions is left to the analysts), or if not we’ll provide external notes with the full details.

So we start with a real-valued (or complex-valued) function f : \mathbb{R}^n \to \mathbb{R}, and we write the variable as x = (x_1, \dots, x_n), so that we can stick to using the notation f(x). Rather than think of the components of x as “time variables” as we did in the one-dimensional case, we’ll usually think of x as representing physical space. And so the periodic behavior of the function f represents periodicity in space. On the other hand our transformed variables will be “frequency” in space, and this will correspond to a vector variable \xi = (\xi_1, \dots, \xi_n). We’ll come back to what the heck “periodicity in space” means momentarily.

Remember that in one dimension the Fourier transform was defined by

\displaystyle \mathscr{F}f(s) = \int_{-\infty}^\infty e^{-2\pi ist}f(t) dt.

And it’s inverse transform was

\displaystyle \mathscr{F}^{-1}g(t) = \int_{-\infty}^\infty e^{2\pi ist}f(s) ds.

Indeed, with the vector x replacing t and \xi replacing s, we have to figure out how to make an analogous definition. The obvious thing to do is to take the place where st is multiplied and replace it with the inner product of x and \xi, which for this post I’ll write x \cdot \xi (usually I write \left \langle x, \xi \right \rangle). This gives us the n-dimensional transform

\displaystyle \mathscr{F}f(\xi) = \int_{\mathbb{R}^n} e^{-2\pi i x \cdot \xi}f(x) dx,

and its inverse

\displaystyle \mathscr{F}^{-1}g(t) = \int_{\mathbb{R}^n} e^{2\pi i x \cdot \xi}f( \xi ) d \xi

Note that the integral is over all of \mathbb{R}^n. To give a clarifying example, if we are in two dimensions we can write everything out in coordinates: x = (x_1, x_2), \xi = (\xi_1, \xi_2), and the formula for the transform becomes

\displaystyle \mathscr{F}f(\xi_1, \xi_2) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-2 \pi i (x_1 \xi_1 + x_2 \xi_2)} f(\xi_1, \xi_2) dx_1 dx_2.

Now that’s a nasty integral if I’ve ever seen one. But for our purposes in this post, this will be as nasty as it gets, for we’re primarily concerned with image analysis. So representing things as vectors of arbitrary dimension is more compact, and we don’t lose anything for it.

Periodicity in Space? It’s All Mostly the Same

Because arithmetic with vectors and arithmetic with numbers is so similar, it turns out that most of the properties of the 1-dimensional Fourier transform hold in arbitrary dimension. For example, the duality of the Fourier transform and its inverse holds, because for vectors e^{-2 \pi i x \cdot (-\xi)} = e^{2 \pi i x \cdot \xi}. So just like in on dimension, we have

\mathscr{F}f(-\xi) = \mathscr{F}^{-1}f(\xi)

And again we have correspondences between algebraic operations: convolution in the spatial domain corresponds to convolution in the frequency domain, the spectrum is symmetric about the origin, etc.

At a more geometric level, though, the Fourier transform does the same sort of thing as it did in the one-dimensional case. Again the complex exponentials form the building blocks of any function we want, and performing a Fourier transform on an n-dimensional function decomposes that function into its frequency components. So a function that is perfectly periodic corresponds to a Fourier spectrum that’s perfectly concentrated at a point.

But what the hell, the reader might ask, is ‘periodicity in space’? Since we’re talking about images anyway, the variables we care about (the coordinates of a pixel) are spatial variables. You could, if you were so inclined, have a function of multiple time variables, and to mathematicians a physical interpretation of dimension is just that, an interpretation. But as confusing as it might sound, it’s actually not so hard to understand the Fourier transform when it’s specialized to image analysis. The idea is that complex exponentials e^{\pm 2 \pi i s \cdot \xi} oscillate in the x variable for a fixed \xi (and since \mathscr{F} has \xi as its input, we do want to fix \xi). The brief mathematical analysis goes like this: if we fix \xi then the complex exponential is periodic with magnitudinal peaks along parallel lines spaced out at a distance of 1/ \left \| \xi \right \| apart. In particular any image is a sum of a bunch of these “complex exponential with a fixed \xi” images that look like stripes with varying widths and orientations (what you see here is just the real part of a particular complex exponential).

Any image can be made from a sum of a whole lot of images like this one. This corresponds to a single point in the Fourier spectrum.

Any image can be made from a sum of a whole lot of images like the ones on top. They correspond to single points in the Fourier spectrum (and their symmetries), as on bottom.

What you see on top is an image, and on bottom its Fourier spectrum. That is, each brightly colored pixel corresponds to a point [x_1, x_2] with a large magnitude for that frequency component |\mathscr{F}f[x_1, x_2]|.

It might be a bit surprising that every image can be constructed as a sum of stripey things, but so was it that any sound can be constructed as a sum of sines and cosines. It’s really just a statement about a basis of some vector space of functions. The long version of this story is laid out beautifully in pages 4 – 7 of these notes. The whole set of notes is wonderful, but this section is mathematically tidy and needs no background; the remainder of the notes outline the details about multidimensional Fourier series mentioned earlier, as well as a lot of other things. In higher dimensions the “parallel lines” idea is much the same, but with lines replaced by hyperplanes normal to the given vector.

Discretizing the Transform

Recall that for a continuous function f of one variable, we spent a bit of time figuring out how to find a good discrete approximation of f, how to find a good discrete approximation of the Fourier transform \mathscr{F}f, and how to find a quick way to transition between the two. In brief: f was approximated by a vector of samples (f[0], f[1], \dots, f[N]), reconstructed the original function (which was only correct at the sampled points) and computed the Fourier transform of that, calling it the discrete Fourier transform, or DFT. We got to this definition, using square brackets to denote list indexing (or vector indexing, whatever):

Definition: Let f = (f[1], \dots f[N]) be a vector in \mathbb{R}^N. Then the discrete Fourier transform of f is defined by the vector (\mathscr{F}f[1], \dots, \mathscr{F}f[N]), where

\displaystyle \mathscr{F}f[j] = \sum_{k=0}^{N-1} f[k]e^{-2 \pi i jk/N}

Just as with the one-dimensional case, we can do the same analysis and arrive at a discrete approximation of an n-dimensional function. Instead of a vector it would be an N \times N \times \dots \times N matrix, where there are n terms in the matrix, one for each variable. In two dimensions, this means the discrete approximation of a function is a matrix of samples taken at evenly-spaced intervals in both directions.

Sticking with two dimensions, the Fourier transform is then a linear operator taking matrices to matrices (which is called a tensor if you want to scare people). It has its own representation like the one above, where each term is a double sum. In terms of image analysis, we can imagine that each term in the sum requires us to look at every pixel of the original image

Definition: Let f = (f[s,t]) be a vector in \mathbb{R}^N \times \mathbb{R}^M, where s ranges from 0, \dots, N-1 and t from 0, \dots, M-1. Then the discrete Fourier transform of f is defined by the vector (\mathscr{F}f[s,t]), where each entry is given by

\displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s, t] e^{-2 \pi i (s x_1 / N + t x_2 / M)}

In the one-dimensional case the inverse transform had a sign change in the exponent and an extra 1/N normalization factor. Similarly, in two dimensions the inverse transform has a normalization factor of 1/NM (1 over the total number of samples). Again we use a capital F to denote the transformed version of f. The higher dimensional transforms are analogous: you get n sums, one for each component, and the normalization factor is the inverse of the total number of samples.

\displaystyle \mathscr{F}^{-1}F[x_1, x_2] = \frac{1}{NM} \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s,t] e^{2 \pi i (sx_1 / N + tx_2 / M)}

Unfortunately, the world of the DFT disagrees a lot on the choice of normalization factor. It turns out that all that really matters is that the exponent is negated in the inverse, and that the product of the constant terms on both the transform and its inverse is 1/NM. So some people will normalize both the Fourier transform and its inverse by 1/ \sqrt{NM}. The reason for this is that it makes the transform and its inverse more similar-looking (it’s just that, cosmetic). The choice of normalization isn’t particularly important for us, but beware: non-canonical choices are out there, and they do affect formulas by adding multiplicative constants.

The Fast Fourier Transform, Revisited

Now one might expect that there is another clever algorithm to drastically reduce the runtime of the 2-dimensional DFT, akin to the fast Fourier transform algorithm (FFT). But actually there is almost no additional insight required to understand the “fast” higher dimensional Fourier transform algorithm, because all the work was done for us in the one dimensional case.

All that we do is realize that each of the inner summations is a 1-dimensional DFT. That is, if we write the inner-most sum as a function of two parameters

\displaystyle g(s, x_2) = \sum_{t=0}^{M-1} f(s,t) e^{-2 \pi i (tx_2 / M)}

then the 2-dimensional FFT is simply

\displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} g(s, x_2) e^{-2 \pi i (sx_1/N)}

But now notice, that we can forget that g(s,x_2) was ever a separate, two-dimensional function. Indeed, since it only depends on the x_2 parameter from out of the sum this is precisely the formula for a 1-dimensional DFT! And so if we want to compute the 2-dimensional DFT using the 1-dimensional FFT algorithm, we can compute the matrix of 1-dimensional DFT entries for all choices of s, x_2 by fixing each value of s in turn and running FFT on the resulting “column” of values. If you followed the program from our last FFT post, then the only difficulty is in understanding how the data is shuffled around and which variables are fixed during the computation of the sub-DFT’s.

To remedy the confusion, we give an example. Say we have the following 3×3 matrix whose DFT we want to compute. Remember, these values are the sampled values of a 2-variable function.

\displaystyle \begin{pmatrix} f[0,0] & f[0,1] & f[0,2] \\ f[1,0] & f[1,1] & f[1,2] \\ f[2,0] & f[2,1] & f[2,2] \end{pmatrix}

The first step in the algorithm is to fix a choice of row, s, and compute the DFT of the resulting row. So let’s fix s = 0, and then we have the resulting row

\displaystyle f_0 = (f[0,0], f[0,1], f[0,2])

It’s DFT is computed (intentionally using the same notation as the inner summation above), as

\displaystyle g[0,x_2] = (\mathscr{F}f_0)[x_2] = \sum_{t=0}^{M-1} f_0[t] e^{- 2 \pi i (t x_2 / M)}

Note that f_0[t] = f[s,t] for our fixed choice of s=0. And so if we do this for all N rows (all 3 rows, in this example), we’ll have performed N FFT’s of size M to get a matrix of values

\displaystyle \begin{pmatrix} g[0,0] & g[0,1] & g[0,2] \\ g[1,0] & g[1,1] & g[1,2] \\ g[2,0] & g[2,1] & g[2,2] \end{pmatrix}

Now we want to compute the rest of the 2-dimensional DFT to the end, and it’s easy: now each column consists of the terms in the outermost sum above (since s is the iterating variable). So if we fix a value of x_2, say x_2 = 1, we get the resulting column

\displaystyle g_1 = (g[0, 1], g[1,1], g[2,1])

and computing a DFT on this row gives

\displaystyle \mathscr{F}f[x_1, 1] = \sum_{s=0}^{N-1} g_1[s] e^{-2 \pi i sx_1 / N}.

Expanding the definition of g as a DFT gets us back to the original formula for the 2-dimensional DFT, so we know we did it right. In the end we get a matrix of the computed DFT values for all x_1, x_2.

Let’s analyze the runtime of this algorithm: in the first round of DFT’s we computed N DFT’s of size M, requiring a total of O(N M \log M), since we know FFT takes time O(M \log M) for a list of length M. In the second round we did it the other way around, computing M DFT’s of size N each, giving a total of

O(NM \log M + NM \log N) = O(NM (\log N + \log M)) = O(NM \log (NM))

In other words, if the size of the image is n = NM, then we are achieving an O(n \log n)-time algorithm, which was precisely the speedup that the FFT algorithm gave us for one-dimension. We also know a lower bound on this problem: we can’t do better than NM since we have to look at every pixel at least once. So we know that we’re only a logarithmic factor away from a trivial lower bound. And indeed, all other known DFT algorithms have the same runtime. Without any assumptions on the input data (or any parallelization), nobody knows of a faster algorithm.

Now let’s turn to the code. If we use our FFT algorithm from last time, the pure Python one (read: very slow), then we can implement the 2D Fourier transform in just two lines of Python code. Full disclosure: we left out some numpy stuff in this code for readability. You can view the entire source file on this blog’s Github page.

def fft2d(matrix):
   fftRows = [fft(row) for row in matrix]
   return transpose([fft(row) for row in transpose(fftRows)])

And we can test it on a simple matrix with one nonzero value in it:

A = [[0,0,0,0], [0,1,0,0], [0,0,0,0], [0,0,0,0]]
for row in fft2d(A):
   print(', '.join(['%.3f + %.3fi' % (x.real, x.imag) for x in row]))

The output is (reformatted in LaTeX, obviously):

\displaystyle \begin{pmatrix} 1 & -i & -1 & i \\ -i & -1 & i & 1 \\ -1 & i & 1 & -i \\ i & 1 & -i & -1 \end{pmatrix}

The reader can verify by hand that this is correct (there’s only one nonzero term in the double sum, so it just boils down to figuring out the complex exponential e^{2 \pi i (x_1 + x_2 / 4)}). We leave it as an additional exercise to the reader to implement the inverse transform, as well as to generalize this algorithm to higher dimensional DFTs.

Some Experiments and Animations

As we did with the 1-dimensional FFT, we’re now going to switch to using an industry-strength FFT algorithm for the applications. We’ll be using the numpy library and its “fft2″ function, along with scipy’s ndimage module for image manipulation. Getting all of this set up was a nightmare (thank goodness for people who guide users like me through this stuff, but even then the headache seemed unending!). As usual, all of the code and images used in the making of this post is available on this blog’s Github page.

And so we can start playing with a sample image, a still from one of my favorite television shows:

sherlock

The Fourier transform of this image (after we convert it to grayscale) can be computed in python:

def fourierSpectrumExample(filename):
   A = ndimage.imread(filename, flatten=True)
   unshiftedfft = numpy.fft.fft2(A)
   spectrum = numpy.log10(numpy.absolute(unshiftedfft) + numpy.ones(A.shape))
   misc.imsave("%s-spectrum-unshifted.png" % (filename.split('.')[0]), spectrum)

With the result:

sherlock-spectrum-unshifted

The Fourier spectrum of Sherlock and Watson (and London).

A few notes: we use the ndimage library to load the image and flatten the colors to grayscale. Then, after we compute the spectrum, we shift and take a logarithm. This is because the raw spectrum values are too massive; plotting them without modification makes the image contrast too high.

Something is odd, though, because the brightest regions are on the edges of the image, where we might expect the highest-frequency elements to be. Actually, it turns out that a raw DFT (as computed by numpy, anyhow) is “shifted.” That is, the indices are much like they were in our original FFT post, so that the “center” of the spectrum (the lowest frequency component) is actually in the corner of the image array.

The numpy folks have a special function designed to alleviate this called fftshift. Applying it before we plot the image gives the following spectrum:

sherlock-spectrum

Now that’s more like it. For more details on what’s going on with shifting and how to use the shifting functions, see this matlab thread. (As a side note, the “smudges” in this image are interesting. We wonder what property of the original image contributes to the smudges)

Shifted or unshifted, this image represents the frequency spectrum of the image. In other words, we could take the inverse DFT of each pixel (and its symmetric partner) of this image separately, add them all together, and get back to our original image! We did just that using a different image (one of size 266 x 189, requiring a mere 25137 frequency components), to produce this video of the process:

Many thanks to James Hance for his relentlessly cheerful art (I have a reddish version of this particular masterpiece on my bedroom wall).

For the interested reader, I followed this youtube video’s recommended workflow to make the time-lapsed movie, along with some additional steps to make the videos play side by side. It took quite a while to generate and process the images, and the frames take up a lot of space. So instead of storing all the frames, the interested reader may find the script used to generate the frames on this blog’s Github page (along with all of the rest of the code used in this blog post).

Digital Watermarking

Now we turn to the main application of Fourier transforms to this post, the task of adding an invisible digital watermark to an image. Just in case the reader lives in a cave, a watermark is a security device used to protect the ownership or authenticity of a particular good. Usually they’re used on money to prevent counterfeits, but they’re often applied to high-resolution images on the web to protect copyrights. But perhaps more than just protect existing copyrights, watermarks as they’re used today are ugly, and mostly prevent people from taking the image (paid for or not) in the first place. Here’s an example from a big proponent of ugly watermarks, Shutterstock.com.

stock-photo

Now if you were the business of copyright litigation, you’d make a lot of money by suing people who took your clients’ images without permission. So rather than prevent people from stealing in the first place, you could put in an invisible watermark into all of your images and then crawl the web looking for stolen images with your watermark. It would be easy enough to automate (Google already did most of the work for you, if you just want to use Google’s search by image feature).

Now I’m more on the side of Fair Use For All, so I wouldn’t hope for a company to actually implement this and make using the internet that much scarier of a place. But the idea makes for an interesting thought experiment and blog post. The idea is simply to modify the spectrum of an image by adding in small, artificial frequency components. That is, the watermarked image will look identical to the original image to a human, but the Fourier spectrum will contain suspicious entries that we can extract if we know where to look.

Implementing the watermarking feature is quite easy, so let’s do that first. Let’s work again with James Hance’s fine artwork.

hance-up-sw-bw

Let’s call our image’s pixel matrix A and say we’re working with grayscale images for simplicity (for color, we just do the same thing to all three color channels). Then we can define a watermark matrix W by the following procedure:

  1. Pick a radius r, a length L, a watermark strength \alpha, and a secret key k.
  2. Using k as a seed to a random number generator, define a random binary vector v of length L.
  3. Pick a subset S of the circle of coordinates centered at the image’s center of radius r, chosen or rejected based on the entries of v.
  4. Let W be the matrix of all zeros (of the same dimension as A with 1’s in the entries of S.
  5. Compute the watermarked image as \mathscr{F}^{-1}(\mathscr{F}(A) + \alpha W). That is, compute the DFT of A, add \alpha W to it, and then compute the inverse Fourier transform of the result.

The code for this is simple enough. To create a random vector:

import random
def randomVector(seed, length):
   random.seed(secretKey)
   return [random.choice([0,1]) for _ in range(length)]

To make the watermark (and flush out all of the technical details of how it’s done:

def makeWatermark(imageShape, radius, secretKey, vectorLength=50):
   watermark = numpy.zeros(imageShape)
   center = (int(imageShape[0] / 2) + 1, int(imageShape[1] / 2) + 1)

   vector = randomVector(secretKey, vectorLength)

   x = lambda t: center[0] + int(radius * math.cos(t * 2 * math.pi / vectorLength))
   y = lambda t: center[1] + int(radius * math.sin(t * 2 * math.pi / vectorLength))
   indices = [(x(t), y(t)) for t in range(vectorLength)]

   for i,location in enumerate(indices):
      watermark[location] = vector[i]

   return watermark

We use the usual parameterization of the circle as t \mapsto (\cos(2 \pi t / n), \sin(2 \pi t / n) scaled to the appropriate radius. Here’s what the watermark looks like as a spectrum:

watermark-spectrum

It’s hard to see the individual pixels, so click it to enlarge.

And then applying a given watermark to an image is super simple.

def applyWatermark(imageMatrix, watermarkMatrix, alpha):
   shiftedDFT = fftshift(fft2(imageMatrix))
   watermarkedDFT = shiftedDFT + alpha * watermarkMatrix
   watermarkedImage = ifft2(ifftshift(watermarkedDFT))

   return watermarkedImage

And that’s all there is to it! One might wonder how the choice of \alpha affects the intensity of the watermark, and indeed here we show a few example values of this method applied to Hance’s piece:

Click to enlarge. It appears that it's not until alpha becomes egregiously large that we notice the effects. This could be in part due to the fact that this is an image of a canvas (which has lots of small textures in the background).

Click to enlarge. The effects are most visible in the rightmost image where alpha = 1,000,000

It appears that it’s not until \alpha becomes egregiously large (over 10,000) that we visibly notice the effects. This could be in part due to the fact that this is an image of a canvas (which has lots of small textures in the background). But it’s good to keep in mind the range of acceptable values when designing a decoding mechanism.

Indeed, a decoding mechanism is conceptually much messier; it’s the art to the encoding mechanism’s science. This paper details one possible way to do it, which is essentially to scale everything up or down to 512×512 pixels and try circles of every possible radius until you find one (or don’t) which is statistically similar to the your random vector. And note that since we have the secret key we can generate the exact same random vector. So what the author of that paper suggests is to extract each circle of pixels from the Fourier spectrum, treating it as a single vector with first entry at angle 0. Then you do some statistical magic (compute cross-correlation or some other similarity measure) between the extracted pixels and your secret-key-generated random vector. If they’re sufficiently similar, then you’ve found your watermark, and otherwise there’s no watermark present.

The code required to do this only requires a few extra lines that aren’t present in the code we’re already presented in this article (numpy does cross-correlation for you), so we leave it as an exercise to the reader: write a program that determines if an image contains our watermark, and test the algorithm on various \alpha and with modifications of the image like rotation, scaling, cropping, and jpeg compression. Part of the benefit of Fourier-based techniques is the resilience of the spectrum to mild applications of these transformations.

Next time we’ll use the Fourier transform to do other cool things to images, like designing filters and combining images in interesting ways.

Until then!

Optimism in the Face of Uncertainty: the UCB1 Algorithm

startups

The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of K actions labeled by the integers \left \{ 1, 2, \dots, K \right \}. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

Exploit away, you lucky ladies.

Exploit away, you lucky ladies.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action i be drawn from a fixed distribution D_i (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm A and a set of actions \left \{1, 2, \dots, K \right \}, the cumulative regret of A in rounds 1, \dots, T is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of A for the first T rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.

Notation

Let’s go ahead and draw up some notation. As before the actions are labeled by integers \left \{ 1, \dots, K \right \}. The reward of action i is a [0,1]-valued random variable X_i distributed according to an unknown distribution and possessing an unknown expected value \mu_i. The game progresses in rounds t = 1, 2, \dots so that in each round we have different random variables X_{i,t} for the reward of action i in round t (in particular, X_{i,t} and X_{i,s} are identically distributed). The X_{i,t} are independent as both t and i vary, although when i varies the distribution changes.

So if we were to play action 2 over and over for T rounds, then the total payoff would be the random variable G_2(T) = \sum_{t=1}^T X_{2,t}. But by independence across rounds and the linearity of expectation, the expected payoff is just \mu_2 T. So we can describe the best action as the action with the highest expected payoff. Define

\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i

We call the action which achieves the maximum i^*.

A policy is a randomized algorithm A which picks an action in each round based on the history of chosen actions and observed rewards so far. Define I_t to be the action played by A in round t and P_i(n) to be the number of times we’ve played action i in rounds 1 \leq t \leq n. These are both random variables. Then the cumulative payoff for the algorithm A over the first T rounds, denoted G_A(T), is just

\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}

and its expected value is simply

\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T)).

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy A after the first T steps, denoted R_A(T) can be written as

\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T)).

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than \Omega(\sqrt{KT}). It is also known that there are algorithms that guarantee no worse than O(\sqrt{KT}) expected regret. The algorithm we’ll see in the next section, however, only guarantees O(\sqrt{KT \log T}). We present it on this blog because of its simplicity and ubiquity in the field.

The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action \mu_i is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose Y_1, \dots, Y_n are independent random variables whose values lie in [0,1] and whose expected values are \mu_i. Call Y = \frac{1}{n}\sum_{i}Y_i and \mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of Y deviates from its mean. Specifically,

\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}

For us, the Y_i will be the payoff variables for a single action j in the rounds for which we choose action j. Then the variable Y is just the empirical average payoff for action j over all the times we’ve tried it. Moreover, a is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for a to find an upper bound big enough to be confident that we’re within a of the true mean.

Indeed, if we call n_j the number of times we played action j thus far, then n = n_j in the equation above, and using a = a(j,T) = \sqrt{2 \log(T) / n_j} we get that \textup{P}(Y > \mu + a) \leq T^{-4}, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action j, its upper bound a grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound a(j, T) is getting bigger for actions we haven’t played; any round in which we play an action j, it must be that a(j, T+1) = a(j,T), although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

UCB1:
Play each of the K actions once, giving initial values for empirical mean payoffs \overline{x}_i of each action i.
For each round t = K, K+1, \dots:

Let n_j represent the number of times action j was played so far.
Play the action j maximizing \overline{x}_j + \sqrt{2 \log t / n_j}.
Observe the reward X_{j,t} and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means x_j change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with K actions, each of whose reward distribution X_{i,t} has values in [0,1]. Then its expected cumulative regret after T rounds is at most O(\sqrt{KT \log T}).

Actually, we’ll prove a more specific theorem. Let \Delta_i be the difference \mu^* - \mu_i, where \mu^* is the expected payoff of the best action, and let \Delta be the minimal nonzero \Delta_i. That is, \Delta_i represents how suboptimal an action is and \Delta is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret \mathbb{E}(R_{\textup{UCB1}}(T)) is at most

\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if \Delta_i is small we will require more tries to know that action i is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the 1 + \pi^2 / 3 part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the O(\sqrt{KT \log T}) bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function R(\Delta_1, \dots, \Delta_K) (excluding of course \Delta^*), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}

and it’s zero if and only if for each i, \Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T}). However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the \Delta_i = O(\sqrt{\log T}) (which are all the same) gives a total bound of O(K \sqrt{\log T}). If we look at the only possible endpoint (the \Delta_i = 1), then we get a local maximum of O(K \sqrt{\log T}). But this isn’t the O(\sqrt{KT \log T}) we promised, what gives? Well, this upper bound grows arbitrarily large as the \Delta_i go to zero. But at the same time, if all the \Delta_i are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the \Delta_i = \Delta are the same, then another trivial regret bound is \Delta T (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at \Delta = O(\sqrt{(K \log T) / T}). Hence the regret bound at the switch is \Delta T = O(\sqrt{KT \log T}), as desired.

Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on P_i(T), the expected number of times UCB chooses an action up to round T. Using the \Delta notation, the regret is then just \sum_i \Delta_i \mathbb{E}(P_i(T)), and bounding the P_i‘s will bound the regret.

Recall the notation for our upper bound a(j, T) = \sqrt{2 \log T / P_j(T)} and let’s loosen it a bit to a(y, T) = \sqrt{2 \log T / y} so that we’re allowed to “pretend” a action has been played y times. Recall further that the random variable I_t has as its value the index of the machine chosen. We denote by \chi(E) the indicator random variable for the event E. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., \overline{x}^* is the empirical mean of the optimal action).

Indeed for any action i, the only way we know how to write down P_i(T) is as

\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action i. Now we’re just going to pull some number m-1 of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than m times overall, this requires an inequality.

P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)

These indicator functions should be read as sentences: we’re just saying that we’re picking action i in round t and we’ve already played i at least m times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action i in round t means that the upper bound for action i exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and i might coincide with the best action, but that’s fine). In notation this event is

\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)

Denote the upper bound \overline{x}_i + a(i,t) for action i in round t by U_i(t). Since this event must occur every time we pick action i (though not necessarily vice versa), we have

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)

We’ll do this process again but with a slightly more complicated event. If the upper bound of action i exceeds that of the optimal machine, it is also the case that the maximum upper bound for action i we’ve seen after the first m trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round t we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine i (except that it’s more than m). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by \overline{x}_{i,s} the random variable for the empirical mean after playing action i a total of s times, and \overline{x}^*_s the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )

Indeed, at each t for which the max is greater than the min, there will be at least one pair s,s' for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs s, s' for which it happens. That is, we can expand the event above into the double sum which is at least as large:

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )

We can make one other odd inequality by increasing the sum to go from t=1 to \infty. This will become clear later, but it means we can replace t-1 with t and thus have

\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t). Then what can we say? Well, consider the following three events:

(1) \displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)
(2) \displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)
(3) \displaystyle \mu^* < \mu_i + 2a(s, t)

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability t^{-4}. Likewise, (2) is the event that the empirical mean payoff of action i is larger than the upper confidence bound, which also occurs with probability t^{-4}. We will see momentarily that (3) is impossible for a well-chosen m (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}

and

\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}

But putting these two inequalities together gives us precisely that (3) is true:

\mu^* < \mu_i + 2a(s,t)

This proves the claim.

By the union bound, the probability that at least one of these events happens is 2t^{-4} plus whatever the probability of (3) being true is. But as we said, we’ll pick m to make (3) always false. Indeed m depends on which action i is being played, and if s \geq m > 8 \log T / \Delta_i^2 then 2a(s,t) \leq \Delta_i, and by the definition of \Delta_i we have

\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

\square

Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
   return math.sqrt(2 * math.log(step + 1) / numPlays)

And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
   payoffSums = [0] * numActions
   numPlays = [1] * numActions
   ucbs = [0] * numActions

   # initialize empirical sums
   for t in range(numActions):
      payoffSums[t] = reward(t,t)
      yield t, payoffSums[t], ucbs

   t = numActions

   while True:
      ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
      action = max(range(numActions), key=lambda i: ucbs[i])
      theReward = reward(action, t)
      numPlays[action] += 1
      payoffSums[action] += theReward

      yield action, theReward, ucbs
      t = t + 1

The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from [0,1], biased by 1/k for some 5 \leq k \leq 15. The regret and theoretical regret bound are given in the graph below.

ucb1-simple-example

The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!

Seam Carving for Content-Aware Image Scaling

The Problem with Cropping

Every programmer or graphic designer with some web development experience can attest to the fact that finding good images that have an exactly specified size is a pain. Since the dimensions of the sought picture are usually inflexible, an uncomfortable compromise can come in the form of cropping a large image down to size or scaling the image to have appropriate dimensions.

Both of these solutions are undesirable. In the example below, the caterpillar looks distorted in the scaled versions (top right and bottom left), and in the cropped version (bottom right) it’s more difficult to tell that the caterpillar is on a leaf; we have lost the surrounding context.

scaling-gone-wrong

In this post we’ll look at a nice heuristic method for rescaling images called seam-carving, which pays attention to the contents of the image as it resacles. In particular, it only removes or adds pixels to the image that the viewer is least-likely to notice. In all but the most extreme cases it will avoid the ugly artifacts introduced by cropping and scaling, and with a bit of additional scaffolding it becomes a very useful addition to a graphic designer’s repertoire. At first we will focus on scaling an image down, and then we will see that the same technique can be used to enlarge an image.

Before we begin, we should motivate the reader with some examples of its use.

example-seam-carving

It’s clear that the caterpillar is far less distorted in all versions, and even in the harshly rescaled version, parts of the green background are preserved. Although the leaf is warped a little, it is still present, and it’s not obvious that the image was manipulated.

Now that the reader’s appetite has been whet, let’s jump into the mathematics of it. This method was pioneered by Avidan and Shamir, and the impatient reader can jump straight to their paper (which contains many more examples). In this post we hope to fill in the background and show a working implementation.

Images as Functions

One common way to view an image is as an approximation to a function of two real variables. Suppose we have an n \times m-pixel image (n rows and m columns of pixels). For simplicity (during the next few paragraphs), we will also assume that the pixel values of an image are grayscale intensity values between 0 and 255. Then we can imagine the pixel values as known integer values of a function f: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}. That is, if we take two integers 0 \leq x < n and 0 \leq y < m then we know the value f(x,y); it’s just the intensity value at the corresponding pixel. For values outside these ranges, we can impose arbitrary values for I (we don’t care what’s happening outside the image).

Moreover, it makes sense to assume that f is a well-behaved function in between the pixels (i.e. it is differentiable). And so we can make reasonable guessed as to the true derivative of f by looking at the differences between adjacent pixels. There are many ways to get a good approximation of the derivative of an image function, but we should pause a moment to realize why this is important to nail down for the purpose of resizing images.

A good rule of thumb with images is that regions of an image which are most important to the viewer are those which contain drastic changes in intensity or color. For instance, consider this portrait of Albert Einstein.

Which parts of this image first catch the eye? The unkempt hair, the wrinkled eyes, the bushy mustache? Certainly not the misty background, or the subtle shadows on his chin.

Indeed, one could even claim that an image having a large derivative at a certain pixel corresponds to high information content there (of course this is not true of all images, but perhaps it’s reasonable to claim this for photographs). And if we want to scale an image down in size, we are interested in eliminating those regions which have the smallest information content. Of course we cannot avoid losing some information: the image after resizing is smaller than the original, and a reasonable algorithm should not add any new information. But we can minimize the damage by intelligently picking which parts to remove; our naive assumption is that a small derivative at a pixel implies a small amount of information.

Of course we can’t just remove “regions” of an image to change its proportions. We have to remove the same number of pixels in each row or column to reduce the corresponding dimension (width or height, resp.). Before we get to that, though, let’s write a program to compute the gradient. For this program and the rest of the post we will use the Processing programming language, and our demonstrations will use the Javascript cross-compiler processing.js. The nice thing about Processing is that if you know Java then you know processing. All the basic language features are the same, and it’s just got an extra few native types and libraries to make graphics rendering and image displaying easier. As usual, all of the code used in this blog post is available on this blog’s Github page.

Let’s compute the gradient of this picture, and call the picture I:

A very nice picture whose gradient we can compute. It was taken by the artist Ria Czichotzki.

Since this is a color image, we will call it a function I: \mathbb{R}^2 \to \mathbb{R}^3, in the sense that the input is a plane coordinate (x,y), and the output I(x,y) = (r,g,b) is a triple of color intensity values. We will approximate the image’s partial derivative \left \langle \partial I / \partial x, \partial I / \partial y \right \rangle at (x,y) by inspecting values of I in a neighborhood of the point:

I(x-1,y), I(x+1, y), I(x,y-1), I(x,y+1).

For each pixel we call the value |I(x+1,y) - I(x-1,y)| / 2 the partial derivative in the x direction, and |I(x,y+1) - I(x,y-1)| / 2 the partial in the y direction. Note that the values I(x,y) are vectors, so the norm signs here are really computing the distance between the two values of I.

There are two ways to see why this makes sense as an approximation. The first is analytic: by definition, the partial derivative \partial I / \partial x is a limit:

\displaystyle \lim_{h \to 0} \frac{|I(x+h,y) - I(x,y)|}{h}

It turns out that this limit is equivalent to

\displaystyle \lim_{h \to 0} \frac{|I(x+h,y) - I(x-h,y)|}{2h}

And the closer h gets to zero the better the approximation of the limit is. Since the closest we can make h is h=1 (we don’t know any other values of I with nonzero h), we plug in the corresponding values for neighboring pixels. The partial \partial I / \partial y is similar.

The second way to view it is geometric.

The slope of the blue secant line is not a bad approximation to the derivative at x, provided the resolution is fine enough.

The slope of the blue secant line is not a bad approximation to the derivative at x, provided the resolution is fine enough.

The salient fact here is that a nicely-behaved curve at x will have a derivative close to the secant line between the points (x-1, f(x-1)) and (x+1, f(x+1)). Indeed, this idea inspires the original definition of the derivative. The slope of the secant line is just (f(x+1) - f(x-1)) / 2. As we saw in our post on numerical integration, we can do much better than a linear guess (specifically, we can use do any order of polynomial interpolation we wish), but for the purposes of displaying the concept of seam-carving, a linear guess will suffice.

And so with this intuitive understanding of how to approximate the gradient, the algorithm to actually do it is a straightforward loop. Here we compute the horizontal gradient (that is, the derivative \partial I / \partial x).

PImage horizontalGradient(PImage img) {
   color left, right;
   int center;
   PImage newImage = createImage(img.width, img.height, RGB);

   for (int x = 0; x < img.width; x++) {
      for (int y = 0; y < img.height; y++) {
         center = x + y*img.width;

         left = x == 0 ? img.pixels[center] : img.pixels[(x-1) + y*img.width];
         right = x == img.width-1 ? img.pixels[center] : img.pixels[(x+1) + y*img.width];

         newImage.pixels[center] = color(colorDistance(left, right));
      }
   }

   return newImage;
}

The details are a bit nit-picky, but the idea is simple. If we’re inspecting a non-edge pixel, then we can use the formula directly and compute the values of the neighboring left and right pixels. Otherwise, the “left” pixel or the “right” pixel will be outside the bounds of the image, and so we replace it with the pixel we’re inspecting. Mathematically, we’d be computing the difference |I(x, y) - I(x+1, y)| and |I(x-1,y) - I(x, y)|. Additionally, since we’ll later only be interested in the relative sizes of the gradient, we can ignore the factor of 1/2 in the formula we derived.

The parts of this code that are specific to Processing also deserve some attention. Specifically, we use the built-in types PImage and color, for representing images and colors, respectively. The “createImage” function creates an empty image of the specified size. And peculiarly, the pixels of a PImage are stored as a one-dimensional array. So as we’re iterating through the rows and columns, we must compute the correct location of the sought pixel in the pixel array (this is why we have a variable called “center”). Finally, as in Java, the ternary if notation is used to keep the syntax short, and those two lines simply check for the boundary conditions we stated above.

The last unexplained bit of the above code is the “colorDistance” function. As our image function I(x,y) has triples of numbers as values, we need to compute the distance between two values via the standard distance formula. We have encapsulated this in a separate function. Note that because (in this section of the blog) we are displaying the results in an image, we have to convert to an integer at the end.

int colorDistance(color c1, color c2) {
   float r = red(c1) - red(c2);
   float g = green(c1) - green(c2);
   float b = blue(c1) - blue(c2);
   return (int)sqrt(r*r + g*g + b*b);
}

Let’s see this in action on the picture we introduced earlier.

gradient-girlThe reader who is interested in comparing the two more closely may visit this interactive page. Note that we only compute the horizontal gradient, so certain locations in the image have a large derivative but are still dark in this image. For instance, the top of the door in the background and the wooden bars supporting the bottom of the chair are dark despite the vertical color variations.

The vertical gradient computation is entirely analogous, and is left as an exercise to the reader.

Since we want to inspect both vertical and horizontal gradients, we will call the total gradient matrix G the matrix whose entries g_{i,j} are the sums of the magnitudes of the horizontal and vertical gradients at i,j:

\displaystyle g_{i,j} = \left | \frac{\partial I}{\partial x} (i,j) \right | + \left | \frac{\partial I}{\partial y} (i,j) \right |

The function e(x,y) = g_{x,y} is often called an energy function for I. We will mention now that there are other energy functions one can consider, and use this energy function for the remainder of this post.

Seams, and Dynamic Programming

Back to the problem of resizing, we want a way to remove only those regions of an image that have low total gradient across all of the pixels in the region removed. But of course when resizing an image we must maintain the rectangular shape, and so we have to add or remove the same number of pixels in each column or row.

For the purpose of scaling an image down in width (and the other cases are similar), we have a few options. We could find the pixel in each row with minimal total gradient and remove it. More conservatively, we could remove those columns with minimal gradient (as a sum of the total gradient of each pixel in the column). More brashly, we could just remove pixels of lowest gradient willy-nilly from the image, and slide the rows left.

If none of these ideas sound like they would work, it’s because they don’t. We encourage the unpersuaded reader to try out each possibility on a variety of images to see just how poorly they perform. But of these options, removing an entire column happens to distort the image less than the others. Indeed, the idea of a “seam” in an image is just a slight generalization of a column. Intuitively, a seam s_i is a trail of pixels traversing the image from the bottom to the top, and at each step the pixel trail can veer to the right or left by at most one pixel.

Definition: Let I be an n \times m image with nonnegative integer coordinates indexed from zero. A vertical seam in I is a list of coordinates s_i = (x_i, y_i) with the following properties:

  • y_0 = 0 is at the bottom of the image.
  • y_{n-1} = n-1 is at the top of the image.
  • y_i is strictly increasing.
  • |x_i - x_{i+1}| \leq 1 for all 0 \leq i < n-1.

These conditions simply formalize what we mean by a seam. The first and second impose that the seam traverses from top to bottom. The third requires the seam to always “go up,” so that there is only one pixel in each row. The last requires the seam to be “connected” in the sense that it doesn’t veer too far at any given step.

Here are some examples of some vertical seams. One can easily define horizontal seams by swapping the placement of x, y in the above list of conditions.

glacier_canyon_h_shr_seams

So the goal is now to remove the seams of lowest total gradient. Here the total gradient of a seam is just the sum of the energy values of the pixels in the seam.

Unfortunately there are many more seams to choose from than columns (or even individual pixels). It might seem difficult at first to find the seam with the minimal total gradient. Luckily, if we’re only interested in minima, we can use dynamic programming to compute the minimal seam ending at any given pixel in linear time.

We point the reader unfamiliar with dynamic programming to our Python primer on this topic. In this case, the sub-problem we’re working with is the minimal total gradient value of all seams from the bottom of the image to a fixed pixel. Let’s call this value v(a,b). If we know v(a,b) for all pixels below, say, row i, then we can compute the v(i+1,b) for the entire row i+1 by taking pixel (i+1,j), and adding its gradient value to the minimum of the values of possible predecessors in a seam, v(i,j-1), v(i,j), v(i,j+1) (respecting the appropriate boundary conditions).

Once we’ve computed v(a,b) for the entire matrix, we can look at the minimal value at the top of the image \min_j v(n,j), and work backwards down the image to compute which seam gave us this minimum.

Let’s make this concrete and compute the function v as a two-dimensional array called “seamFitness.”

void computeVerticalSeams() {
   seamFitness = new float[img.width][img.height];
   for (int i = 0; i < img.width; i++) {
      seamFitness[i][0] = gradientMagnitude[i][0];
   }

   for (int y = 1; y < img.height; y++) {
      for (int x = 0; x < img.width; x++) {
         seamFitness[x][y] = gradientMagnitude[x][y];

         if (x == 0) {
            seamFitness[x][y] += min(seamFitness[x][y-1], seamFitness[x+1][y-1]);
         } else if (x == img.width-1) {
            seamFitness[x][y] += min(seamFitness[x][y-1], seamFitness[x-1][y-1]);
         } else {
            seamFitness[x][y] += min(seamFitness[x-1][y-1], seamFitness[x][y-1], seamFitness[x+1][y-1]);
         }
      }
   }
}

We have two global variables at work here (global is bad, I know, but it’s Processing; it’s made for prototyping). The seamFitness array, and the gradientMagnitude array. We assume at the start of this function that the gradientMagnitude array is filled with sensible values.

Here we first initialize the zero’th row of the seamFitness array to have the same values as the gradient of the image. This is simply because a seam of length 1 has only one gradient value. Note here the coordinates are a bit backwards: the first coordinate represents the choice of a column, and the second represents the choice of a row. We can think of the coordinate axes of our image function having the origin in the bottom-left, the same as we might do mathematically.

Then we iterate over the rows in the matrix, and in each column we compute the fitness based on the fitness of the previous row. That’s it :)

To actually remove a seam, we need to create a new image of the right size, and shift the pixels to the right (or left) of the image into place. The details are technically important, but tedious to describe fully. So we leave the inspection of the code as an exercise to the reader. We provide the Processing code on this blog’s Github page, and show an example of its use below. Note each the image resizes every time the user clicks within the image.

seam-carving-demo

Photograph by Raphael Goetter.

It’s interesting (and indeed the goal) to see how at first nothing is warped, and then the lines on the walls curve around the woman’s foot, and then finally the woman’s body is distorted before she gets smushed into a tiny box by the oppressive mouse.

As a quick side note, we attempted to provide an interactive version of this Processing program online in the same way we did for the gradient computation example. Processing is quite nice in that any Processing program (which doesn’t use any fancy Java libraries) can be cross-compiled to Javascript via the processing.js library. This is what we did for the gradient example. But in doing so for the (admittedly inefficient and memory-leaky) seam-carving program, it appeared to run an order of magnitude slower in the browser than locally. This was this author’s first time using Processing, so the reason for the drastic jump in runtime is unclear. If any readers are familiar with processing.js, a clarification would be very welcome in the comments.

Inserting Seams, Removing Objects, and Videos

In addition to removing seams to scale an image down, one can just as easily insert seams to make an image larger. To insert a seam, just double each pixel in the seam and push the rest of the pixels on the row to the right. The process is not hard, but it requires avoiding one pitfall: if we just add a single seam at a time, then the seam with minimum total energy will never change! So we’ll just add the same seam over and over again. Instead, if we want to add k seams, one should compute the minimum k seams and insert them all. If the desired resize is too large, then the programmer should pick an appropriate batch size and add seams in batches.

Another nice technique that comes from the seam-carving algorithm is to intelligently protect or destroy specific regions in the image. To do this requires a minor modification of the gradient computation, but the rest of the algorithm is identical. To protect a region, provide some way of user input specifying which pixels in the image are important, and give those pixels an artificially large gradient value (e.g., the maximum value of an integer). If the down-scaling is not too extreme, the seam computations will be guaranteed not to use any of those pixels, and inserted seams will never repeat those pixels. To remove a region, we just give the desired pixels an arbitrarily low gradient value. Then these pixels will be guaranteed to occur in the minimal seams, and will be removed from the picture.

The technique of seam-carving is a very nice tool, and as we just saw it can be extended to a variety of other techniques. In fact, seam-carving and its applications to object removal and image resizing are implemented in all of the recent versions of Photoshop. The techniques are used to adapt applications to environments with limited screen space, such as a mobile phone or tablet. Seam carving can even be adapted for use in videos. This involves an extension of the dynamic program to work across multiple frames, formally finding a minimal graph cut between two frames so that each piece of the cut is a seam in the corresponding frame. Of course there is a lot more detail to it (and the paper linked above uses this detail to improve the basic image-resizing algorithm), but that’s the rough idea.

We’ve done precious little on this blog with images, but we’d like to get more into graphics programming. There’s a wealth of linear algebra, computational geometry, and artificial intelligence hiding behind most of the computer games we like to play, and it would be fun to dive deeper into these topics. Of course, with every new post this author suggests ten new directions for this blog to go. It’s a curse and a blessing.

Until next time!