# The Two-Dimensional Fourier Transform and Digital Watermarking

We’ve studied the Fourier transform quite a bit on this blog: with four primers and the Fast Fourier Transform algorithm under our belt, it’s about time we opened up our eyes to higher dimensions.

Indeed, in the decades since Cooley & Tukey’s landmark paper, the most interesting applications of the discrete Fourier transform have occurred in dimensions greater than 1. But for all our work we haven’t yet discussed what it means to take an “n-dimensional” Fourier transform. Our past toiling and troubling will pay off, though, because the higher Fourier transform and its 1-dimensional cousin are quite similar. Indeed, the shortest way to describe the $n$-dimensional transform is as the 1-dimensional transform with inner products of vector variables replacing regular products of variables.

In this post we’ll flush out these details. We’ll define the multivariable Fourier transform and it’s discrete partner, implement an algorithm to compute it (FFT-style), and then apply the transform to the problem of digitally watermarking images.

As usual, all the code, images, and examples used in this post are available on this blog’s Github page.

## Sweeping Some Details Under the Rug

We spent our first and second primers on Fourier analysis describing the Fourier series in one variable, and taking a limit of the period to get the Fourier transform in one variable. By all accounts, it was a downright mess of notation and symbol manipulation that culminated in the realization that the Fourier series looks a lot like a Riemann sum. So it was in one dimension, it is in arbitrary dimension, but to save our stamina for the applications we’re going to treat the $n$-dimensional transform differently. We’ll use the 1-dimensional transform as a model, and magically generalize it to operate on a vector-valued variable. Then the reader will take it on faith that we could achieve the same end as a limit of some kind of multidimensional Fourier series (and all that nonsense with Schwarz functions and tempered distributions is left to the analysts), or if not we’ll provide external notes with the full details.

So we start with a real-valued (or complex-valued) function $f : \mathbb{R}^n \to \mathbb{R}$, and we write the variable as $x = (x_1, \dots, x_n)$, so that we can stick to using the notation $f(x)$. Rather than think of the components of $x$ as “time variables” as we did in the one-dimensional case, we’ll usually think of $x$ as representing physical space. And so the periodic behavior of the function $f$ represents periodicity in space. On the other hand our transformed variables will be “frequency” in space, and this will correspond to a vector variable $\xi = (\xi_1, \dots, \xi_n)$. We’ll come back to what the heck “periodicity in space” means momentarily.

Remember that in one dimension the Fourier transform was defined by

$\displaystyle \mathscr{F}f(s) = \int_{-\infty}^\infty e^{-2\pi ist}f(t) dt$.

And it’s inverse transform was

$\displaystyle \mathscr{F}^{-1}g(t) = \int_{-\infty}^\infty e^{2\pi ist}f(s) ds$.

Indeed, with the vector $x$ replacing $t$ and $\xi$ replacing $s$, we have to figure out how to make an analogous definition. The obvious thing to do is to take the place where $st$ is multiplied and replace it with the inner product of $x$ and $\xi$, which for this post I’ll write $x \cdot \xi$ (usually I write $\left \langle x, \xi \right \rangle$). This gives us the $n$-dimensional transform

$\displaystyle \mathscr{F}f(\xi) = \int_{\mathbb{R}^n} e^{-2\pi i x \cdot \xi}f(x) dx$,

and its inverse

$\displaystyle \mathscr{F}^{-1}g(t) = \int_{\mathbb{R}^n} e^{2\pi i x \cdot \xi}f( \xi ) d \xi$

Note that the integral is over all of $\mathbb{R}^n$. To give a clarifying example, if we are in two dimensions we can write everything out in coordinates: $x = (x_1, x_2), \xi = (\xi_1, \xi_2)$, and the formula for the transform becomes

$\displaystyle \mathscr{F}f(\xi_1, \xi_2) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-2 \pi i (x_1 \xi_1 + x_2 \xi_2)} f(\xi_1, \xi_2) dx_1 dx_2$.

Now that’s a nasty integral if I’ve ever seen one. But for our purposes in this post, this will be as nasty as it gets, for we’re primarily concerned with image analysis. So representing things as vectors of arbitrary dimension is more compact, and we don’t lose anything for it.

## Periodicity in Space? It’s All Mostly the Same

Because arithmetic with vectors and arithmetic with numbers is so similar, it turns out that most of the properties of the 1-dimensional Fourier transform hold in arbitrary dimension. For example, the duality of the Fourier transform and its inverse holds, because for vectors $e^{-2 \pi i x \cdot (-\xi)} = e^{2 \pi i x \cdot \xi}$. So just like in on dimension, we have

$\mathscr{F}f(-\xi) = \mathscr{F}^{-1}f(\xi)$

And again we have correspondences between algebraic operations: convolution in the spatial domain corresponds to convolution in the frequency domain, the spectrum is symmetric about the origin, etc.

At a more geometric level, though, the Fourier transform does the same sort of thing as it did in the one-dimensional case. Again the complex exponentials form the building blocks of any function we want, and performing a Fourier transform on an $n$-dimensional function decomposes that function into its frequency components. So a function that is perfectly periodic corresponds to a Fourier spectrum that’s perfectly concentrated at a point.

But what the hell, the reader might ask, is ‘periodicity in space’? Since we’re talking about images anyway, the variables we care about (the coordinates of a pixel) are spatial variables. You could, if you were so inclined, have a function of multiple time variables, and to mathematicians a physical interpretation of dimension is just that, an interpretation. But as confusing as it might sound, it’s actually not so hard to understand the Fourier transform when it’s specialized to image analysis. The idea is that complex exponentials $e^{\pm 2 \pi i s \cdot \xi}$ oscillate in the $x$ variable for a fixed $\xi$ (and since $\mathscr{F}$ has $\xi$ as its input, we do want to fix $\xi$). The brief mathematical analysis goes like this: if we fix $\xi$ then the complex exponential is periodic with magnitudinal peaks along parallel lines spaced out at a distance of $1/ \left \| \xi \right \|$ apart. In particular any image is a sum of a bunch of these “complex exponential with a fixed $\xi$” images that look like stripes with varying widths and orientations (what you see here is just the real part of a particular complex exponential).

Any image can be made from a sum of a whole lot of images like the ones on top. They correspond to single points in the Fourier spectrum (and their symmetries), as on bottom.

What you see on top is an image, and on bottom its Fourier spectrum. That is, each brightly colored pixel corresponds to a point $[x_1, x_2]$ with a large magnitude for that frequency component $|\mathscr{F}f[x_1, x_2]|$.

It might be a bit surprising that every image can be constructed as a sum of stripey things, but so was it that any sound can be constructed as a sum of sines and cosines. It’s really just a statement about a basis of some vector space of functions. The long version of this story is laid out beautifully in pages 4 – 7 of these notes. The whole set of notes is wonderful, but this section is mathematically tidy and needs no background; the remainder of the notes outline the details about multidimensional Fourier series mentioned earlier, as well as a lot of other things. In higher dimensions the “parallel lines” idea is much the same, but with lines replaced by hyperplanes normal to the given vector.

## Discretizing the Transform

Recall that for a continuous function $f$ of one variable, we spent a bit of time figuring out how to find a good discrete approximation of $f$, how to find a good discrete approximation of the Fourier transform $\mathscr{F}f$, and how to find a quick way to transition between the two. In brief: $f$ was approximated by a vector of samples $(f[0], f[1], \dots, f[N])$, reconstructed the original function (which was only correct at the sampled points) and computed the Fourier transform of that, calling it the discrete Fourier transform, or DFT. We got to this definition, using square brackets to denote list indexing (or vector indexing, whatever):

Definition: Let $f = (f[1], \dots f[N])$ be a vector in $\mathbb{R}^N$. Then the discrete Fourier transform of $f$ is defined by the vector $(\mathscr{F}f[1], \dots, \mathscr{F}f[N])$, where

$\displaystyle \mathscr{F}f[j] = \sum_{k=0}^{N-1} f[k]e^{-2 \pi i jk/N}$

Just as with the one-dimensional case, we can do the same analysis and arrive at a discrete approximation of an $n$-dimensional function. Instead of a vector it would be an $N \times N \times \dots \times N$ matrix, where there are $n$ terms in the matrix, one for each variable. In two dimensions, this means the discrete approximation of a function is a matrix of samples taken at evenly-spaced intervals in both directions.

Sticking with two dimensions, the Fourier transform is then a linear operator taking matrices to matrices (which is called a tensor if you want to scare people). It has its own representation like the one above, where each term is a double sum. In terms of image analysis, we can imagine that each term in the sum requires us to look at every pixel of the original image

Definition: Let $f = (f[s,t])$ be a vector in $\mathbb{R}^N \times \mathbb{R}^M$, where $s$ ranges from $0, \dots, N-1$ and $t$ from $0, \dots, M-1$. Then the discrete Fourier transform of $f$ is defined by the vector $(\mathscr{F}f[s,t])$, where each entry is given by

$\displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s, t] e^{-2 \pi i (s x_1 / N + t x_2 / M)}$

In the one-dimensional case the inverse transform had a sign change in the exponent and an extra $1/N$ normalization factor. Similarly, in two dimensions the inverse transform has a normalization factor of $1/NM$ (1 over the total number of samples). Again we use a capital $F$ to denote the transformed version of $f$. The higher dimensional transforms are analogous: you get $n$ sums, one for each component, and the normalization factor is the inverse of the total number of samples.

$\displaystyle \mathscr{F}^{-1}F[x_1, x_2] = \frac{1}{NM} \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s,t] e^{2 \pi i (sx_1 / N + tx_2 / M)}$

Unfortunately, the world of the DFT disagrees a lot on the choice of normalization factor. It turns out that all that really matters is that the exponent is negated in the inverse, and that the product of the constant terms on both the transform and its inverse is $1/NM$. So some people will normalize both the Fourier transform and its inverse by $1/ \sqrt{NM}$. The reason for this is that it makes the transform and its inverse more similar-looking (it’s just that, cosmetic). The choice of normalization isn’t particularly important for us, but beware: non-canonical choices are out there, and they do affect formulas by adding multiplicative constants.

## The Fast Fourier Transform, Revisited

Now one might expect that there is another clever algorithm to drastically reduce the runtime of the 2-dimensional DFT, akin to the fast Fourier transform algorithm (FFT). But actually there is almost no additional insight required to understand the “fast” higher dimensional Fourier transform algorithm, because all the work was done for us in the one dimensional case.

All that we do is realize that each of the inner summations is a 1-dimensional DFT. That is, if we write the inner-most sum as a function of two parameters

$\displaystyle g(s, x_2) = \sum_{t=0}^{M-1} f(s,t) e^{-2 \pi i (tx_2 / M)}$

then the 2-dimensional FFT is simply

$\displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} g(s, x_2) e^{-2 \pi i (sx_1/N)}$

But now notice, that we can forget that $g(s,x_2)$ was ever a separate, two-dimensional function. Indeed, since it only depends on the $x_2$ parameter from out of the sum this is precisely the formula for a 1-dimensional DFT! And so if we want to compute the 2-dimensional DFT using the 1-dimensional FFT algorithm, we can compute the matrix of 1-dimensional DFT entries for all choices of $s, x_2$ by fixing each value of $s$ in turn and running FFT on the resulting “column” of values. If you followed the program from our last FFT post, then the only difficulty is in understanding how the data is shuffled around and which variables are fixed during the computation of the sub-DFT’s.

To remedy the confusion, we give an example. Say we have the following 3×3 matrix whose DFT we want to compute. Remember, these values are the sampled values of a 2-variable function.

$\displaystyle \begin{pmatrix} f[0,0] & f[0,1] & f[0,2] \\ f[1,0] & f[1,1] & f[1,2] \\ f[2,0] & f[2,1] & f[2,2] \end{pmatrix}$

The first step in the algorithm is to fix a choice of row, $s$, and compute the DFT of the resulting row. So let’s fix $s = 0$, and then we have the resulting row

$\displaystyle f_0 = (f[0,0], f[0,1], f[0,2])$

It’s DFT is computed (intentionally using the same notation as the inner summation above), as

$\displaystyle g[0,x_2] = (\mathscr{F}f_0)[x_2] = \sum_{t=0}^{M-1} f_0[t] e^{- 2 \pi i (t x_2 / M)}$

Note that $f_0[t] = f[s,t]$ for our fixed choice of $s=0$. And so if we do this for all $N$ rows (all 3 rows, in this example), we’ll have performed $N$ FFT’s of size $M$ to get a matrix of values

$\displaystyle \begin{pmatrix} g[0,0] & g[0,1] & g[0,2] \\ g[1,0] & g[1,1] & g[1,2] \\ g[2,0] & g[2,1] & g[2,2] \end{pmatrix}$

Now we want to compute the rest of the 2-dimensional DFT to the end, and it’s easy: now each column consists of the terms in the outermost sum above (since $s$ is the iterating variable). So if we fix a value of $x_2$, say $x_2 = 1$, we get the resulting column

$\displaystyle g_1 = (g[0, 1], g[1,1], g[2,1])$

and computing a DFT on this row gives

$\displaystyle \mathscr{F}f[x_1, 1] = \sum_{s=0}^{N-1} g_1[s] e^{-2 \pi i sx_1 / N}$.

Expanding the definition of $g$ as a DFT gets us back to the original formula for the 2-dimensional DFT, so we know we did it right. In the end we get a matrix of the computed DFT values for all $x_1, x_2$.

Let’s analyze the runtime of this algorithm: in the first round of DFT’s we computed $N$ DFT’s of size $M$, requiring a total of $O(N M \log M)$, since we know FFT takes time $O(M \log M)$ for a list of length $M$. In the second round we did it the other way around, computing $M$ DFT’s of size $N$ each, giving a total of

$O(NM \log M + NM \log N) = O(NM (\log N + \log M)) = O(NM \log (NM))$

In other words, if the size of the image is $n = NM$, then we are achieving an $O(n \log n)$-time algorithm, which was precisely the speedup that the FFT algorithm gave us for one-dimension. We also know a lower bound on this problem: we can’t do better than $NM$ since we have to look at every pixel at least once. So we know that we’re only a logarithmic factor away from a trivial lower bound. And indeed, all other known DFT algorithms have the same runtime. Without any assumptions on the input data (or any parallelization), nobody knows of a faster algorithm.

Now let’s turn to the code. If we use our FFT algorithm from last time, the pure Python one (read: very slow), then we can implement the 2D Fourier transform in just two lines of Python code. Full disclosure: we left out some numpy stuff in this code for readability. You can view the entire source file on this blog’s Github page.

def fft2d(matrix):
fftRows = [fft(row) for row in matrix]
return transpose([fft(row) for row in transpose(fftRows)])


And we can test it on a simple matrix with one nonzero value in it:

A = [[0,0,0,0], [0,1,0,0], [0,0,0,0], [0,0,0,0]]
for row in fft2d(A):
print(', '.join(['%.3f + %.3fi' % (x.real, x.imag) for x in row]))


The output is (reformatted in LaTeX, obviously):

$\displaystyle \begin{pmatrix} 1 & -i & -1 & i \\ -i & -1 & i & 1 \\ -1 & i & 1 & -i \\ i & 1 & -i & -1 \end{pmatrix}$

The reader can verify by hand that this is correct (there’s only one nonzero term in the double sum, so it just boils down to figuring out the complex exponential $e^{2 \pi i (x_1 + x_2 / 4)}$). We leave it as an additional exercise to the reader to implement the inverse transform, as well as to generalize this algorithm to higher dimensional DFTs.

## Some Experiments and Animations

As we did with the 1-dimensional FFT, we’re now going to switch to using an industry-strength FFT algorithm for the applications. We’ll be using the numpy library and its “fft2” function, along with scipy’s ndimage module for image manipulation. Getting all of this set up was a nightmare (thank goodness for people who guide users like me through this stuff, but even then the headache seemed unending!). As usual, all of the code and images used in the making of this post is available on this blog’s Github page.

And so we can start playing with a sample image, a still from one of my favorite television shows:

The Fourier transform of this image (after we convert it to grayscale) can be computed in python:

def fourierSpectrumExample(filename):
unshiftedfft = numpy.fft.fft2(A)
spectrum = numpy.log10(numpy.absolute(unshiftedfft) + numpy.ones(A.shape))
misc.imsave("%s-spectrum-unshifted.png" % (filename.split('.')[0]), spectrum)


With the result:

The Fourier spectrum of Sherlock and Watson (and London).

A few notes: we use the ndimage library to load the image and flatten the colors to grayscale. Then, after we compute the spectrum, we shift and take a logarithm. This is because the raw spectrum values are too massive; plotting them without modification makes the image contrast too high.

Something is odd, though, because the brightest regions are on the edges of the image, where we might expect the highest-frequency elements to be. Actually, it turns out that a raw DFT (as computed by numpy, anyhow) is “shifted.” That is, the indices are much like they were in our original FFT post, so that the “center” of the spectrum (the lowest frequency component) is actually in the corner of the image array.

The numpy folks have a special function designed to alleviate this called fftshift. Applying it before we plot the image gives the following spectrum:

Now that’s more like it. For more details on what’s going on with shifting and how to use the shifting functions, see this matlab thread. (As a side note, the “smudges” in this image are interesting. We wonder what property of the original image contributes to the smudges)

Shifted or unshifted, this image represents the frequency spectrum of the image. In other words, we could take the inverse DFT of each pixel (and its symmetric partner) of this image separately, add them all together, and get back to our original image! We did just that using a different image (one of size 266 x 189, requiring a mere 25137 frequency components), to produce this video of the process:

Many thanks to James Hance for his relentlessly cheerful art (I have a reddish version of this particular masterpiece on my bedroom wall).

For the interested reader, I followed this youtube video’s recommended workflow to make the time-lapsed movie, along with some additional steps to make the videos play side by side. It took quite a while to generate and process the images, and the frames take up a lot of space. So instead of storing all the frames, the interested reader may find the script used to generate the frames on this blog’s Github page (along with all of the rest of the code used in this blog post).

## Digital Watermarking

Now we turn to the main application of Fourier transforms to this post, the task of adding an invisible digital watermark to an image. Just in case the reader lives in a cave, a watermark is a security device used to protect the ownership or authenticity of a particular good. Usually they’re used on money to prevent counterfeits, but they’re often applied to high-resolution images on the web to protect copyrights. But perhaps more than just protect existing copyrights, watermarks as they’re used today are ugly, and mostly prevent people from taking the image (paid for or not) in the first place. Here’s an example from a big proponent of ugly watermarks, Shutterstock.com.

Now if you were the business of copyright litigation, you’d make a lot of money by suing people who took your clients’ images without permission. So rather than prevent people from stealing in the first place, you could put in an invisible watermark into all of your images and then crawl the web looking for stolen images with your watermark. It would be easy enough to automate (Google already did most of the work for you, if you just want to use Google’s search by image feature).

Now I’m more on the side of Fair Use For All, so I wouldn’t hope for a company to actually implement this and make using the internet that much scarier of a place. But the idea makes for an interesting thought experiment and blog post. The idea is simply to modify the spectrum of an image by adding in small, artificial frequency components. That is, the watermarked image will look identical to the original image to a human, but the Fourier spectrum will contain suspicious entries that we can extract if we know where to look.

Implementing the watermarking feature is quite easy, so let’s do that first. Let’s work again with James Hance’s fine artwork.

Let’s call our image’s pixel matrix $A$ and say we’re working with grayscale images for simplicity (for color, we just do the same thing to all three color channels). Then we can define a watermark matrix $W$ by the following procedure:

1. Pick a radius $r$, a length $L$, a watermark strength $\alpha$, and a secret key $k$.
2. Using $k$ as a seed to a random number generator, define a random binary vector $v$ of length $L$.
3. Pick a subset $S$ of the circle of coordinates centered at the image’s center of radius $r$, chosen or rejected based on the entries of $v$.
4. Let $W$ be the matrix of all zeros (of the same dimension as $A$ with 1’s in the entries of $S$.
5. Compute the watermarked image as $\mathscr{F}^{-1}(\mathscr{F}(A) + \alpha W)$. That is, compute the DFT of $A$, add $\alpha W$ to it, and then compute the inverse Fourier transform of the result.

The code for this is simple enough. To create a random vector:

import random
def randomVector(seed, length):
random.seed(secretKey)
return [random.choice([0,1]) for _ in range(length)]


To make the watermark (and flush out all of the technical details of how it’s done:

def makeWatermark(imageShape, radius, secretKey, vectorLength=50):
watermark = numpy.zeros(imageShape)
center = (int(imageShape[0] / 2) + 1, int(imageShape[1] / 2) + 1)

vector = randomVector(secretKey, vectorLength)

x = lambda t: center[0] + int(radius * math.cos(t * 2 * math.pi / vectorLength))
y = lambda t: center[1] + int(radius * math.sin(t * 2 * math.pi / vectorLength))
indices = [(x(t), y(t)) for t in range(vectorLength)]

for i,location in enumerate(indices):
watermark[location] = vector[i]

return watermark


We use the usual parameterization of the circle as $t \mapsto (\cos(2 \pi t / n), \sin(2 \pi t / n)$ scaled to the appropriate radius. Here’s what the watermark looks like as a spectrum:

It’s hard to see the individual pixels, so click it to enlarge.

And then applying a given watermark to an image is super simple.

def applyWatermark(imageMatrix, watermarkMatrix, alpha):
shiftedDFT = fftshift(fft2(imageMatrix))
watermarkedDFT = shiftedDFT + alpha * watermarkMatrix
watermarkedImage = ifft2(ifftshift(watermarkedDFT))

return watermarkedImage


And that’s all there is to it! One might wonder how the choice of $\alpha$ affects the intensity of the watermark, and indeed here we show a few example values of this method applied to Hance’s piece:

Click to enlarge. The effects are most visible in the rightmost image where alpha = 1,000,000

It appears that it’s not until $\alpha$ becomes egregiously large (over 10,000) that we visibly notice the effects. This could be in part due to the fact that this is an image of a canvas (which has lots of small textures in the background). But it’s good to keep in mind the range of acceptable values when designing a decoding mechanism.

Indeed, a decoding mechanism is conceptually much messier; it’s the art to the encoding mechanism’s science. This paper details one possible way to do it, which is essentially to scale everything up or down to 512×512 pixels and try circles of every possible radius until you find one (or don’t) which is statistically similar to the your random vector. And note that since we have the secret key we can generate the exact same random vector. So what the author of that paper suggests is to extract each circle of pixels from the Fourier spectrum, treating it as a single vector with first entry at angle 0. Then you do some statistical magic (compute cross-correlation or some other similarity measure) between the extracted pixels and your secret-key-generated random vector. If they’re sufficiently similar, then you’ve found your watermark, and otherwise there’s no watermark present.

The code required to do this only requires a few extra lines that aren’t present in the code we’re already presented in this article (numpy does cross-correlation for you), so we leave it as an exercise to the reader: write a program that determines if an image contains our watermark, and test the algorithm on various $\alpha$ and with modifications of the image like rotation, scaling, cropping, and jpeg compression. Part of the benefit of Fourier-based techniques is the resilience of the spectrum to mild applications of these transformations.

Next time we’ll use the Fourier transform to do other cool things to images, like designing filters and combining images in interesting ways.

Until then!

# The Fast Fourier Transform

John Tukey, one of the developers of the Cooley-Tukey FFT algorithm.

It’s often said that the Age of Information began on August 17, 1964 with the publication of Cooley and Tukey’s paper, “An Algorithm for the Machine Calculation of Complex Fourier Series.” They published a landmark algorithm which has since been called the Fast Fourier Transform algorithm, and has spawned countless variations. Specifically, it improved the best known computational bound on the discrete Fourier transform from $O(n^2)$ to $O(n \log n)$, which is the difference between uselessness and panacea.

Indeed, their work was revolutionary because so much of our current daily lives depends on efficient signal processing. Digital audio and video, graphics, mobile phones, radar and sonar, satellite transmissions, weather forecasting, economics and medicine all use the Fast Fourier Transform algorithm in a crucial way. (Not to mention that electronic circuits wouldn’t exist without Fourier analysis in general.) Before the Fast Fourier Transform algorithm was public knowledge, it simply wasn’t feasible to process digital signals.

Amusingly, Cooley and Tukey’s particular algorithm was known to Gauss around 1800 in a slightly different context; he simply didn’t find it interesting enough to publish, even though it predated the earliest work on Fourier analysis by Joseph Fourier himself.

In this post we will derive and implement a Fast Fourier Transform algorithm, and explore a (perhaps naive) application to audio processing. In particular, we will remove white noise from a sound clip by filtering the frequency spectrum of a noisy signal.

As usual, all of the resources used for this post are available on this blog’s Github page.

## Derivation

It turns out that there are a number of different ways to speed up the naive Fourier Transform computation. As we saw in our primer on the discrete Fourier transform, the transform itself can be represented as a matrix. With a bit of nontrivial analysis, one can factor the Fourier transform matrix into a product of two sparse matrices (i.e., which contain mostly zeros), and from there the operations one can skip are evident. The intuitive reason why this should work is that the Fourier transform matrix is very structured, and the complex exponential has many useful properties. For more information on this method, see these lecture notes (p. 286).

We will take a different route which historically precedes the matrix factorization method. In fact, the derivation we’ll trace through below was Cooley and Tukey’s original algorithm. The method itself is a classical instance of a divide and conquer algorithm. Familiar programmers would recognize the ideas in common sorting algorithms: both mergesort and quicksort are divide and conquer algorithms.

The difficulty here is to split a list of numbers into two lists which are half in size in such a way that the Fourier transforms of the smaller pieces can be quickly combined to form the Fourier transform of the original list. Once we accomplish that, the rest of the algorithm falls out from recursion; further splitting each piece into two smaller pieces, we will eventually reach lists of length two or one (in which case the Fourier transform is completely trivial).

For now, we’ll assume that the length of the list is a power of 2. That is,

$f = (f[1], f[2], \dots, f[n]), n = 2^k$ for some $k$.

We will also introduce the somewhat helpful notation for complex exponentials. Specifically, $\omega[p,q] = e^{2\pi i q/p}$. Here the $p$ will represent the length of the list, and $q$ will be related to the index. In particular, the complex exponential that usually shows up in the discrete Fourier transform (again, refer to our primer) is $\omega[n, -km] = e^{-2 \pi i mk/n}$. We write the discrete Fourier transform of $f$ by specifying its $m$-th entry:

$\displaystyle \mathscr{F}f[m] = \sum_{k=0}^{n-1} f[k]\omega[n, -km]$

Now the trick here is to split up the sum into two lists each of half size. Of course, one could split it up in many different ways, but after we split it we need to be able to rewrite the pieces as discrete Fourier transforms themselves. It turns out that if we split it into the terms with even indices and the terms with odd indices, things work out nicely.

Specifically, we can write the sum above as the two sums below. The reader should study this for a moment (or better yet, try to figure out what it should be without looking) to ensure that the indices all line up. The notation grows thicker ahead, so it’s good practice.

$\displaystyle \mathscr{F}f[m] = \sum_{k=0}^{\frac{n}{2} - 1} f[2k] \omega[n,-2km] + \sum_{k=0}^{\frac{n}{2} - 1} f[2k+1]\omega[n,-(2k+1)m]$.

Now we need to rewrite these pieces as Fourier transforms. Indeed, we must replace the occurrences of $n$ in the complex exponentials with $n/2$. This is straightforward in the first summation, since

$\omega[n, -2km] = e^{-2 \pi i km/n} = \omega[\frac{n}{2}, -km]$.

For the second summation, we observe that

$\displaystyle \omega[n, -(2k+1)m] = \omega[n, -2km] \cdot \omega[n,-m]$.

Now if we factor out the $\omega[n,-m]$, we can transform the second sum in the same way as the first, but with that additional factor out front. In other words, we now have

$\displaystyle \mathscr{F}f[m] = \sum_{k=0}^{\frac{n}{2}-1} f[2k] \omega[n/2, -km] + \omega[n, -m] \sum_{k=0}^{\frac{n}{2}-1}f[2k+1] \omega[n/2, -km]$.

If we denote the list of even-indexed entries of $f$ by $f_{\textup{even}}$, and vice versa for $f_{\textup{odd}}$, we see that this is just a combination of two Fourier transforms of length $n/2$. That is,

$\displaystyle \mathscr{F}f = \mathscr{F}f_{\textup{even}} + \omega[n,-m] \mathscr{F}f_{\textup{odd}}$

But we have a big problem here. Computer scientists will recognize this as a type error. The thing on the left hand side of the equation is a list of length $n$, while the thing on the right hand side is a sum of two lists of length $n/2$, and so it has length $n/2$. Certainly it is still true for values of $m$ which are less than $n/2$; we broke no algebraic laws in the way we rewrote the sum (just in the use of the $\mathscr{F}$ notation).

Recalling our primer on the discrete Fourier transform, we naturally extended the signals involved to be periodic. So indeed, the length-$n/2$ Fourier transforms satisfy the following identity for each $m$.

$\mathscr{F}f_{\textup{even}}[m] = \mathscr{F}f_{\textup{even}}[m+n/2] \\ \mathscr{F}f_{\textup{odd}}[m] = \mathscr{F}f_{\textup{odd}}[m+n/2]$

However, this does not mean we can use the same formula above! Indeed, for a length $n$ Fourier transform, it is not true in general that $\mathscr{F}f[m] = \mathscr{F}f[m+n/2]$. But the correct formula is close to it. Plugging in $m + n/2$ to the summations above, we have

$\displaystyle \mathscr{F}f[m+n/2] = \sum_{k=0}^{\frac{n}{2} - 1}f[2k] \omega[n/2, -(m+n/2)k] + \\ \omega[n, -(m+n/2)] \sum_{k=0}^{\frac{n}{2} - 1}f[2k+1] \omega[n/2, -(m+n/2)k]$

Now we can use the easy-to-prove identity

$\displaystyle \omega[n/2, -(m+n/2)k] = \omega[n/2, -mk] \omega[n/2, -kn/2]$

And see that the right-hand-term is simply $e^{2 \pi i k} = 1$. Similarly, one can trivially prove the identity

$\displaystyle \omega[n, -(m+n/2)] = \omega[n, -m] \omega[n, -n/2] = -\omega[n, -m]$,

this simplifying the massive formula above to the more familiar

$\displaystyle \mathscr{F}f[m + n/2] = \sum_k f[2k]\omega[n/2, -mk] - \omega[n, -m] \sum_k f[2k+1] \omega[n/2, -mk]$

Now finally, we have the Fast Fourier Transform algorithm expressed recursively as:

$\displaystyle \mathscr{F}f[m] = \mathscr{F}f_{\textup{even}}[m] + \omega[n,-m] \mathscr{F}f_{\textup{odd}}[m]$
$\displaystyle \mathscr{F}f[m+n/2] = \mathscr{F}f_{\textup{even}}[m] - \omega[n,-m] \mathscr{F}f_{\textup{odd}}[m]$

With the base case being $\mathscr{F}([a]) = [a]$.

## In Python

With all of that notation out of the way, the implementation is quite short. First, we should mention a few details about complex numbers in Python. Much to this author’s chagrin, Python represents $i$ using the symbol 1j. That is, in python, $a + bi$ is

a + b * 1j

Further, Python reserves a special library for complex numbers, the cmath library. So we implement the omega function above as follows.

import cmath

def omega(p, q):
return cmath.exp((2.0 * cmath.pi * 1j * q) / p)

And then the Fast Fourier Transform algorithm is more or less a straightforward translation of the mathematics above:

def fft(signal):
n = len(signal)
if n == 1:
return signal
else:
Feven = fft([signal[i] for i in xrange(0, n, 2)])
Fodd = fft([signal[i] for i in xrange(1, n, 2)])

combined = [0] * n
for m in xrange(n/2):
combined[m] = Feven[m] + omega(n, -m) * Fodd[m]
combined[m + n/2] = Feven[m] - omega(n, -m) * Fodd[m]

return combined

Here I use the awkward variable names “Feven” and “Fodd” to be consistent with the notation above. Note that this implementation is highly non-optimized. There are many ways to improve the code, most notably using a different language and cacheing the complex exponential computations. In any event, the above code is quite readable, and a capable programmer could easily speed this up by orders of magnitude.

Of course, we should test this function on at least some of the discrete signals we investigated in our primer. And indeed, it functions correctly on the unshifted delta signal

>>> fft([1,0,0,0])
[(1+0j), (1+0j), (1+0j), (1+0j)]
>>> fft([1,0,0,0,0,0,0,0])
[(1+0j), (1+0j), (1+0j), (1+0j), (1+0j), (1+0j), (1+0j), (1+0j)]

As well as the shifted verion (up to floating point roundoff error)

>>> fft([0,1,0,0])
[(1+0j),
(6.123233995736766e-17-1j),
(-1+0j),
(-6.123233995736766e-17+1j)]
>>> fft([0,1,0,0,0,0,0,0])
[(1+0j),
(0.7071067811865476-0.7071067811865475j),
(6.123233995736766e-17-1j),
(-0.7071067811865475-0.7071067811865476j),
(-1+0j),
(-0.7071067811865476+0.7071067811865475j),
(-6.123233995736766e-17+1j),
(0.7071067811865475+0.7071067811865476j)]

And testing it on various other shifts gives further correct outputs. Finally, we test the Fourier transform of the discrete complex exponential, which the reader will recall is a scaled delta.

>>> w = cmath.exp(2 * cmath.pi * 1j / 8)
>>> w
(0.7071067811865476+0.7071067811865475j)
>>> d = 4
>>> fft([w**(k*d) for k in range(8)])
[                           1.7763568394002505e-15j,
(-7.357910944937894e-16  + 1.7763568394002503e-15j),
(-1.7763568394002505e-15 + 1.7763568394002503e-15j),
(-4.28850477329429e-15   + 1.7763568394002499e-15j),
(8                       - 1.2434497875801753e-14j),
(4.28850477329429e-15    + 1.7763568394002505e-15j),
(1.7763568394002505e-15  + 1.7763568394002505e-15j),
(7.357910944937894e-16   + 1.7763568394002505e-15j)]

Note that $n = 8$ occurs in the position with index $d=4$, and all of the other values are negligibly close to zero, as expected.

So that’s great! It works! Unfortunately it’s not quite there in terms of usability. In particular, we require our signal to have length a power of two, and most signals don’t happen to be that long. It turns out that this is a highly nontrivial issue, and all implementations of a discrete Fourier transform algorithm have to compensate for it in some way.

The easiest solution is to simply add zeros to the end of the signal until it is long enough, and just work with everything in powers of two. This technique (called zero-padding) is only reasonable if the signal in question is actually zero outside of the range it’s sampled from. Otherwise, subtle and important things can happen. We’ll leave further investigations to the reader, but the gist of the idea is that the Fourier transform assumes periodicity of the data one gives it, and so padding with zeros imposes a kind of periodicity that is simply nonexistent in the actual signal.

Now, of course not every Fast Fourier transform uses zero-padding. Unfortunately the techniques to evaluate a non-power-of-two Fourier transform are relatively complex, and beyond the scope of this post (though not beyond the scope of this blog). The interested reader should investigate the so-called Chirp Z-transform, as discovered by Rader, et al. in 1968. We may cover it here in the future.

In any case, the code above is a good starting point for any technique, and as usual it is available on this blog’s Github page. Finally, the inverse transform has a simple implementation, since it can be represented in terms of the forward transform (hint: remember duality!). We leave the code as an exercise to the reader.

## Experiments with Sound — I am no Tree!

We’ll manipulate a clip of audio from Treebeard, a character from Lord of the Rings.

For the remainder of this post we’ll use a more established Fast Fourier Transform algorithm from the Python numpy library. The reasons for this are essentially convenience. Being implemented in C and brandishing the full might of the literature on Fourier transform algorithms, the numpy implementation is lightning fast. Now, note that the algorithm we implemented above is still correct (if one uses zero padding)! The skeptical reader may run the code to verify this. So we are justified in abandoning our implementation until we decide to seriously focus on improving its speed.

So let’s play with sound.

The sound clip we’ll be using is the following piece of dialog from the movie The Lord of the Rings: The Two Towers. We include the original wav file with the code on this blog’s Github page.

If we take the Fourier transform of the sound sample, we get the following plot of the frequency spectrum. Recall, the frequency spectrum is the graph of the norm of the frequency values.

The frequency spectrum of an Ent speaking. The x-axis represents the index in the list (larger values correspond to larger frequencies), and the y-axis corresponds to intensity (larger values correspond to that frequency having a greater presence in the original signal). There is symmetry about the ~80,000 index, and we may consider the right half ‘negative’ frequencies because the Fourier Transform is periodic modulo its length.

Here we note that there is a symmetry to the graph. This is not a coincidence: if the input signal is real-valued, it will always be the case that the Fourier transform is symmetric about its center value. The reason for this goes back to our first primer on the Fourier series, in that the negative coefficients were complex conjugates of the positive ones. In any event, we only need concern ourselves with the first half of the values.

We will omit the details for doing file I/O with audio. The interested reader can inspect our source code, and they will find a very basic use of the scikits.audiolab library, and matplotlib for plotting the frequency spectrum.

Now, just for fun, let’s tinker with this audio bite. First we’ll generate some random noise in the signal. That is, let’s mutate the input signal as follows:

import random
inputSignal = [x/2.0 + random.random()*0.1 for x in inputSignal]

This gives us a sound bite with an annoying fuzziness in the background:

Next, we will use the Fourier transform to remove some of this noise. Of course, since the noise is random it inhabits all the frequencies to some degree, so we can’t eliminate it entirely. Furthermore, as the original audio clip uses some of the higher frequencies, we will necessarily lose some information in the process. But in the real world we don’t have access to the original signal, so we should clean it up as best we can without losing too much information.

To do so, we can plot the frequencies for the noisy signal:

Comparing this with our original graph (cheating, yes, but the alternative is to guess and check until we arrive at the same answer), we see that the noise begins to dominate the spectrum at around the 20,000th index. So, we’ll just zero out all of those frequencies.

def frequencyFilter(signal):
for i in range(20000, len(signal)-20000):
signal[i] = 0

And voila! The resulting audio clip, while admittedly damaged, has only a small trace of noise:

Inspecting the source code, one should note that at one point we halve the amplitude of the input signal (i.e., in the time domain). The reason for this is if one arbitrarily tinkers with the frequencies of the Fourier transform, it can alter the amplitude of the original signal. As one quickly discovers, playing the resulting signal as a wav file can create an unpleasant crackling noise. The amplitudes are clipped (or wrapped around) by the audio software or hardware for a reason which is entirely mysterious to this author.

In any case, this is clearly not the optimal way (or even a good way) to reduce white noise in a signal. There are certainly better methods, but for the sake of time we will save them for future posts. The point is that we were able to implement and use the Fast Fourier Transform algorithm to analyze the discrete Fourier transform of a real-world signal, and manipulate it in logical ways. That’s quite the milestone considering where we began!

Next up in this series we’ll investigate more techniques on sound processing and other one-dimensional signals, and we’ll also derive a multi-dimensional Fourier transform so that we can start working with images and other two-dimensional signals.

Until then!

# The Discrete Fourier Transform — A Primer

So here we are. We have finally made it to a place where we can transition with confidence from the classical continuous Fourier transform to the discrete version, which is the foundation for applications of Fourier analysis to programming. Indeed, we are quite close to unfurling the might of the Fast Fourier Transform algorithm, which efficiently computes the discrete Fourier transform. But because of its focus on algorithmic techniques, we will save it for a main content post and instead focus here on the intuitive connections between the discrete and continuous realms.

The goal has roughly three parts:

1. Find a reasonable discrete approximation to a continuous function.
2. Find a reasonable discrete approximation to the Fourier transform of a continuous function.
3. Find a way to transition between the two discrete representations.

We should also note that there will be some notational clashes in the sequel. Rigorously, item 3 will result in an operation which we will denote by $\mathscr{F}$, but will be distinct from the classical Fourier transform on continuous functions. Indeed, they will have algebraic similarities, but one operates on generalized functions, and the other on finite sequences. We will keep the distinction clear from context to avoid adding superfluous notation. Moreover, we will avoid the rigor from the last post on tempered distributions. Instead we simply assume all functions are understood to be distributions and use the classical notation. Again, this is safe because our dabbling with the classical transform will be solely for intuition.

Of course, the point of this entire post is that all of the facts we proved about the continuous Fourier transform have direct translations into the discrete case. Up to a constant factor (sometimes) and up to notational conventions, the two theories will be identical. This is part of the beauty of the subject; sit back and enjoy it.

## Sampling

There is a very nice theorem about classical Fourier transforms that has to do with reconstructing a function from a discrete sample of its points. Since we do not actually need this theorem for anything other than motivation, we will not prove it here. Instead, we’ll introduce the definitions needed to state it, and see why it motivates a good definition for the discrete “approximation” to a function. For a much more thorough treatment of the sampling theorem and the other issues we glaze over in this post, see these lecture notes.

Definition: A function $f$ is time limited if it has bounded support. A function $f$ is bandlimited if its Fourier transform has bounded support. That is, if there is some $B$ so that the Fourier transform of $f$ is only nonzero when $|s|. We call $B$ the bandwidth of $f$.

To be honest, before seeing a mathematical treatment of signal processing, this author had no idea what bandwidth actually referred to. It’s nice to finally solve those life mysteries.

In any case, it often occurs that one has a signal $f$ for which one can only measure values, but one doesn’t have access to an exact description of $f$. The sampling theorem allows us to reconstruct $f$ exactly by choosing certain sample points. In a simplified form, the theorem is as follows:

TheoremSuppose $f$ is a function of bandwidth $B$. Then one can reconstruct $f$ exactly from the set of points $(k/2B, f(k/2B))$ as $k$ ranges over $\mathbb{Z}$ (that is, the sampling rate is at least $1/2B$).

Unsurprisingly, the proof uses the Fourier transform in a nontrivial way. Moreover, there is a similar theorem for the Fourier transform $\mathscr{F}f$, which can be reconstructed exactly from its samples if the sampling rate is at least $1/L$, where $L/2$ bounds the support of $f$. Note that the sampling rate in one domain is determined by the limitations on the other domain.

What’s more, if we are daring enough to claim $f$ is both time limited and bandlimited (and we sample as little as possible in each domain), then the number of sample points is the same for both domains. In particular, suppose $f(t)$ is only nonzero on $0 \leq t \leq L$ and its Fourier transform on $0 \leq s \leq 2B$. Note that since $f$ is both timelimited and bandlimited, there is no fault in shifting both so their smallest value is at the origin. Then, if $n, m$ are the respective numbers of sampled points in the time and frequency domains, then $m/L = 2B, n/2B = L$, and so $m = n = 2BL$.

Now this gives us a good reason to define the discrete approximation to a signal as

$\displaystyle f_{\textup{sampled}} = (f(t_0), f(t_1), \dots, f(t_{n-1})),$

where $t_k = k/2B$. This accomplishes our first task. Now, in order to determine what the discrete Fourier transform should look like, we can represent this discrete approximation as a distribution using shifted deltas:

$\displaystyle f_{\textup{sampled}}(t) = \sum_{k=0}^{n-1}f(t_k)\delta(t-t_k)$

Here the deltas ensure the function is zero at points not among the $t_k$, capturing the essence of the finite sequence above. Taking the Fourier transform of this (recalling that the Fourier transform of the shifted delta is a complex exponential and using linearity), we get

$\displaystyle \mathscr{F}f_{\textup{sampled}}(s) = \sum_{k=0}^{n-1}f(t_k)e^{-2 \pi i s t_k}$

Denote the above function by $F$. Now $F$ is unfortunately still continuous, so we take the same approach we did for $f$ and sample it via the samples in the frequency domain at $s_j = j/L$. This gives the following list of values for the discrete approximation to the transformed version of $f$:

$\displaystyle F(s_0) = \sum_{k=0}^{n-1} f(t_k)e^{-2 \pi i s_0t_k}$

$\displaystyle F(s_1) = \sum_{k=0}^{n-1} f(t_k)e^{-2 \pi i s_1t_k}$

$\vdots$

$\displaystyle F(s_n) = \sum_{k=0}^{n-1} f(t_k)e^{-2 \pi i s_nt_k}$

And now the rest falls together. We can denote by $(\mathscr{F}f)_{\textup{sampled}}$ the tuple of values $F(s_j)$, and the list of formulas above gives a way to transition from one domain to the other.

## A Discrete Life

Now we move completely away from the continuous realm. The above discussion was nice in giving us an intuition for why the following definitions are reasonable. However, in order to actually use these ideas on discrete sequences, we can’t be restricted to sampling continuous functions. In fact, most of our planned work on this blog will not use discrete approximations to continuous functions, but just plain old sequences. So we just need to redefine the above ideas in a purely discrete realm.

The first step is to get rid of any idea of the sampling rate, and instead write everything in terms of the indices. This boils down to the a convenient coincidence. If $t_k = k/2B, s_j = j/L$, then by our earlier remarks that $2BL = n$ (the number of sample points taken), then $t_ks_j = kj/n$. This allows us to rewrite the complex exponentials as $e^{-2 \pi i kj/n}$, which is entirely in terms of the number of points used and the indices of the summation.

In other words, we can finally boil all of this discussion down to the following definition. Before we get to it, we should note that we use the square bracket notation for lists. That is, if $L$ is a list, then $L[i]$ is the $i$-th element of that list. Usually one would use subscripts to denote this, but we’d like to stay close to our notation for the continuous case, at least to make the parallels between the two theories more obvious.

DefinitionLet $f = (f[1], \dots f[n])$ be a vector in $\mathbb{R}^n$. Then the discrete Fourier transform of $f$ is defined by the vector $(\mathscr{F}f[1], \dots, \mathscr{F}f[n])$, where

$\displaystyle \mathscr{F}f[j] = \sum_{k=0}^{n-1} f[k]e^{-2 \pi i jk/n}$

The phrase “discrete Fourier transform” is often abbreviated to DFT.

Now although we want to showcase the connections between the discrete and continuous Fourier transforms, we should note that they are completely disjoint. If one so desired (and many books follow this route) one could start with the discrete Fourier transform and later introduce the continuous version. We have the advantage of brevity here, in that we know what the theorems should say before we actually prove them. The point is that while the two transforms have connections and similarities, their theory is largely independent, and moreover the definition of the Fourier transform can be given without any preface. Much as we claimed with the continuous Fourier transform, the discrete Fourier transform is simply an operation on lists of data (i.e., on vectors).

Pushing the discrete to the extreme, we can represent the list of complex exponentials as a discrete signal too.

Definition: The discrete complex exponential of period $n$ is the list

$\displaystyle \omega_n = (1, e^{2 \pi i/n}, \dots, e^{2 \pi i (n-1)/n}).$

We will omit the subscript $n$ when it is understood (at least, for the rest of this primer). And hence $\omega[k] = e^{2 \pi i k/n}$ and moreover $\omega^m[k] = \omega^k[m] = e^{2\pi imk/n}$. If we denote by $\omega^n$ the vector of entry-wise powers of $\omega$, then we can write the discrete Fourier transform in its most succinct form as:

$\displaystyle \mathscr{F}f[m] = \sum_{k=0}^{n-1} f[k]\omega^{-m}[k]$

or, recognizing everything without indexing in “operator notation,”

$\displaystyle \mathscr{F}f = \sum_{k=0}^{n-1}f\omega^{-m}.$

There are other ways to represent the discrete Fourier transform as an operation. In fact, since we know the classical Fourier transform is a linear operator, we should be able to come up with a matrix representation of the discrete transform. We will do just this, and as a result we will find the inverse discrete Fourier transform, but first we should look at some examples.

## The Transforms of Discrete Signals

Perhaps the simplest possible signal one could think of is the delta function, whose discretized form is

$\displaystyle \delta = (1,0, \dots, 0)$

with the corresponding shifted form

$\displaystyle \delta_k = (0, 0, \dots 0, 1, 0, \dots 0)$

where the 1 appears in the $k$-th spot. The reader should be convinced that this is indeed the right definition because it’s continuous cousin’s defining properties translate nicely. Specifically, $\sum_{k}\delta_m[k] f[k] = f[m]$ for all $m$, and $\sum_{k}\delta[k] = 1$.

Now to find its Fourier transform, we simply use the definition:

$\displaystyle \mathscr{F}\delta[m] = \sum_{k=0}^{n-1}\delta[k] \omega^{-m}[k]$

The only time that $\delta[k]$ is nonzero is for $k=0$, so this is simply the vector $\delta[0] \omega^{0}$, which is the vector consisting entirely of 1’s. Hence, as in the continuous case (or at least, for the continuous definition of the 1 distribution),

$\displaystyle \mathscr{F}\delta = 1$

In an identical manner, one can prove that $\mathscr{F}\delta_k = \omega^{-k}$, just as it was in for the continuous transform.

Now let’s do an example which deviates slightly from the continuous case. Recall that the continuous Fourier transform of the complex exponential was a delta function (to convince the reader, simply see that a single complex exponential can only have a single angular variable, and hence a singular frequency). In the discrete case, we get something similar.

$\displaystyle \mathscr{F}\omega^{d} = \sum_{k=0}^{n-1} \omega^d\omega^{-m}$,

so looking at the $m$-th entry of this vector, we get

$\displaystyle \mathscr{F}\omega^d[m] = \sum_{k=0}^{n-1} \omega^d[k] \omega^{-m}[k]$

but because $\omega^{-m}[k] = e^{-2 \pi i km/n} = \overline{\omega^m[k]}$, we see that the sum is just the usual complex inner product $\left \langle \omega^d, \omega^m \right \rangle$. Further, as the differing powers of $\omega$ are orthogonal (hint: their complex inner product forms a geometric series), we see that they’re only nonzero when $d =m$. In this case, it is easy to show that the inner product is precisely $n$. Summarizing,

This is naturally $n \delta_d$, so our final statement is just $\mathscr{F}\omega^d = n\delta_d$. Note that here we have the extra factor of $n$ floating around. The reason for this boils down to the fact that the norm of the complex exponential $\omega$ is $\sqrt{n}$, and not 1. That is, the powers of $\omega$ do not form an orthonormal basis of $\mathbb{C}^n$. There are various alternative definitions that try to compensate for this, but to the best of this author’s knowledge a factor of $n$ always shows up in some of the resulting formulas. In other words, we should just accept this deviation from the continuous theory as collateral damage.

The mischievous presence of $n$ also shows up in an interesting way in the inverse discrete Fourier transform.

## The DFT and its Inverse, as a Matrix

It’s a trivial exercise to check by hand that the discrete Fourier transform is a linear operation on vectors. i.e., $\mathscr{F}(v_1 + v_2)[m] = \mathscr{F}v_1[m] + \mathscr{F}v_2[m]$ for all vectors $v_1, v_2$ and all $m$. As we know from our primer on linear algebra, all such mappings are expressible as matrix multiplication by a fixed matrix.

The “compact” form we represented the discrete Fourier transform above sheds light onto this question. Specifically, the formula

$\displaystyle \mathscr{F}f[m] = \sum_{k=0}^{n-1}f[k]\omega^{-m}[k]$

Is basically just the definition of a matrix multiplication. Viewing $f$ as the column vector

$\displaystyle \begin{pmatrix} f[0]\\ f[1]\\ \vdots\\ f[n] \end{pmatrix}$

It is easy to see that the correct matrix to act on this vector is

A perhaps easier way to digest this matrix is by viewing each row of the matrix as the vector $\omega^{-m}$.

$\displaystyle \mathscr{F} = \begin{pmatrix} \textbf{---} & \omega^0 & \textbf{---} \\ \textbf{---} & \omega^{-1} & \textbf{---} \\ & \vdots & \\ \textbf{---} & \omega^{-(n-1)} & \textbf{---} \end{pmatrix}$

Now let’s just admire this matrix for a moment. What started many primers ago as a calculus computation requiring infinite integrals and sometimes divergent answers is now trivial to compute. This is our first peek into how to compute discrete Fourier transforms in a program, but unfortunately we are well aware of the fact that computing this naively requires $O(n^2)$ computations. Our future post on the Fast Fourier Transform algorithm will take advantage of the structure of this matrix to improve this by an order or magnitude to $O(n \log n)$.

But for now, let’s find the inverse Fourier transform as a matrix. From the first of the two matrices above, it is easy to see that the matrix for $\mathscr{F}$ is symmetric. Indeed, the roles of $k,m$ are identical in $e^{-2\pi i km/n}$. Furthermore, looking at the second matrix, we see that $\mathscr{F}$ is almost unitary. Recall that a matrix $A$ is unitary if $AA^* = I$, where $I$ is the identity matrix and $A^*$ is the conjugate transpose of $A$. For the case of $\mathscr{F}$, we have that $\mathscr{FF^*} = nI$. We encourage the reader to work this out by hand, noting that each entry in the matrix resulting from $\mathscr{FF^*}$ is an inner product of powers of $\omega$.

In other words, we have $\mathscr{F}\cdot \frac{1}{n}\mathscr{F^*} = I$, so that $\mathscr{F}^{-1} = \frac{1}{n}\mathscr{F^*}$. Since $\mathscr{F}$ is symmetric, this simplifies to $\frac{1}{n}\overline{\mathscr{F}}$.

Expanding this out to a summation, we get what we might have guessed was the inverse transform:

$\displaystyle \mathscr{F}^{-1}f[m] = \frac{1}{n} \sum_{k=0}^{n-1} f[k]\omega^m[k]$.

The only difference between this formula and our intuition is the factor of $1/n$.

## Duality

The last thing we’d like to mention in this primer is that the wonderful results on duality for the continuous Fourier transform translate to the discrete (again, with a factor of $n$). While we leave the details as an exercise to the reader,

In order to do this, we need some additional notation. We can think of $\omega$ as an infinite sequence which would repeat itself every $n$ entries (by Euler’s identity). Then we can “index” $\omega$ higher than $n$, and get the identity $\omega^m[k] = \omega[mk]$. Similarly, we could “periodize” a discrete signal $f$ so that $f[m]$ is defined by $f[m \mod n]$ and we allow $m \in \mathbb{Z}$.

This periodization allows us to define $f^-$ naturally as $f^-[m] = f[-m \mod n]$. Then it becomes straightforward to check that $\mathscr{F}f^- = (\mathscr{F}f)^-$, and as a corollary $\mathscr{FF}f = nf^-$. This recovers some of our most powerful tools from the classical case for computing Fourier transforms of specific functions.

Next time (and this has been a long time coming) we’ll finally get to the computational meat of Fourier analysis. We’ll derive and implement the Fast Fourier Transform algorithm, which computes the discrete Fourier transform efficiently. Then we’ll take our first foray into playing with real signals, and move on to higher-dimensional transforms for image analysis.

Until then!

# Generalized Functions — A Primer

Last time we investigated the naive (which I’ll henceforth call “classical”) notion of the Fourier transform and its inverse. While the development wasn’t quite rigorous, we nevertheless discovered elegant formulas and interesting properties that proved useful in at least solving differential equations. Of course, we wouldn’t be following this trail of mathematics if it didn’t result in some worthwhile applications to programming. While we’ll get there eventually, this primer will take us deeper down the rabbit hole of abstraction. We will develop the necessary framework required to reason about Fourier transforms in a mathematically rigorous manner. Most importantly, we will avoid the divergent integrals which, when we try to use them in an otherwise rigorous proof, make our stomachs heave.

It turns out the rigorous theory is wholly neat and tidy. The whole idea hinges on a part of linear algebra which is slightly more advanced than what we’ve seen so far on this blog, but it is by all means accessible to the reader who has mastered our relevant primers. And even though we still will overlook some of the more minute details of the theory, we will cover a nontrivial portion of it, and leave further exploration to the reader’s whim and discretion.

## A Bit of Motivation

Every tidy mathematical theory deserves some kind of motivation, and the theory of Fourier transforms is no different. The primary motivating question, however, does not often change. That is, we want to ask, “Which functions make the classical Fourier transform as nice as possible?” and build our theory from that foundation. The tricky part is rigorously defining what it means to be “good” for the Fourier transform. One obvious condition we should require is that such a function have a classically-defined Fourier transform (that is, the integral defining its transform converges), but is that enough?

It turns out that this is not enough. Taking for granted the many years of mathematical development and genius that resulted in the correct conditions, we state the following criteria:

Criteria: A class of functions $C$ is good for Fourier transforms if it satisfies the following two criteria:

1. If $f \in C$ then $\mathscr{F}f \in C$ and $\mathscr{F}^{-1}f \in C$.
2. $\mathscr{FF}^{-1}f = f = \mathscr{F}^{-1}\mathscr{F}f$ for all $f \in C$.

Now if we can find a class of functions which satisfies these two criteria, then it should be a good candidate to base our formal theory of Fourier transforms on. But this raises the obvious question: what does one mean by “basing a theory” on a class of functions? It would be a waste of time to try to put it in less general non-mathematical terms, so let’s just slide right into it.

## Generalized Functions

We’ll begin with the complete unabridged definition:

Definition: Given a space $A$ of functions $f: \mathbb{R} \to \mathbb{R}$, a generalized function on $A$ is a continuous linear functional on $A$.

Breaking the definition down, a linear functional is a function $A \to \mathbb{R}$ which is linear. Explicitly, if $T$ is a linear functional, then $T$ operates on functions, and outputs complex numbers in a way that the following identity holds:

$\displaystyle T(af + bg) = aTf + bTg$

for all $a, b \in \mathbb{C}, f,g \in A$.

The requirements on the space $A$ are a bit tricky to pin down, and the details begin to overwhelm the reader quite quickly. We will be content to say that $A$ has some notion of distance which is complete. That is, sequences of functions which converge have their limit inside the space. Such a notion is required to talk about continuity.

With all that securely bolted down, we can finally say that a generalized function is just a continuous linear functional, i.e., a linear mapping $T: A \to \mathbb{C}$ which satisfies

$T(f_n) \to Tf$ whenever $f_n \to f$ in $A$.

Often generalized functions on $A$ are simply called distributions on $A$, but this author thinks that term clashes with probability distributions in an unnatural way. In fact, all probability distributions can be realized as generalized functions on some suitably chosen function space, so the name isn’t there without a reason. But for a student with no formal measure theory or probability theory under his belt, the coincidence of names can be confusing. We will stick to the term generalized function.

There is another interesting bit of notation that accompanies generalized functions, and that is to write a generalized function as if it were an inner product. i.e., instead of writing $Tf$ for a linear functional $T$ operating on a function $f$, one writes $\left \langle T, f \right \rangle$. We personally think this is silly, but we will bring it up later when we discuss the Fourier transform, because this choice of notation hints at a deep mathematical theorem. The interpretation usually given to physics students is that this notation represents a “pairing” between $T$ and $f$, and that the way you can tell a generalized function from a regular function is by the fact that it’s defined by how it “operates” with other functions.

The most prominent example of this is the Dirac delta function $\delta$. When trying to define it, one might say ridiculous things like “it’s zero everywhere except at the origin, where it’s infinite,” and, “the integral from $-\infty$ to $\infty$ of the delta function is 1.” Both of these claims ludicrously defy classical analysis; the delta simply cannot be defined as a function. On the other hand, it has a very natural definition in terms of how it operates when “multiplied” by another function and then integrated. In our pairing notation, this definition is that $\left \langle \delta, f \right \rangle = f(0)$.

To prove this works rigorously, let $A$ be any space of functions for which convergence in $A$ implies pointwise convergence. Consider the mapping $\delta(f) = f(0)$ extended linearly from any basis. The linearity condition is manifestly held, and given a convergent sequence $f_n \to f$, the sequence $\delta(f_n) = f_n(0) \to f(0)$, since we required pointwise convergence.

In fact, it will work in situations with weaker hypotheses (and for Fourier transforms the hypotheses certainly are weaker), but these details are beyond the scope of this primer (for instance, see this necessary and sufficient condition).

For the purpose of working with Fourier transforms, our generalized functions will be defined almost always by integration. That is, even though we have yet to figure out what $A$ is, we want a generalized function $T$ to act on a function using the usual kinds of inner products on function spaces like $L^2$.

One large class of examples of generalized functions are those which are induced by regular functions. That is, we will be able to take any function $g$ (discontinuous or wild as you like), and define the generalized function $T_g$ by

$\displaystyle T_g(f) = \int_{-\infty}^{\infty} g(x)f(x)dx$

Now, of course we require that $f$ satisfy whatever conditions we impose on $A$, and these will ensure that the integral always converges, no matter what $g$ is.

Now supposing our criteria for being “good for Fourier transforms” holds for $A$, and we have such a class where we can define linear functionals by integration. Then we can define quite easily the Fourier transform of a generalized function on $A$.

There is a nice derivation here, and it goes like this: if $T$ is a generalized function and it happens to be induced by a Fourier-transformable function $g(x)$, then the Fourier transform of $T$ is defined by

$\displaystyle (\mathscr{F}T)(\varphi) = \int_{-\infty}^{\infty} \mathscr{F}g(x)\varphi(x)dx$

And since $g$ is transformable, we can expand the classical definition as an integral, giving

$\displaystyle \int_{-\infty}^{\infty} \left ( \int_{-\infty}^{\infty} e^{-2\pi ixy} g(y) dy \right ) \varphi(x) dx$

Swapping the order of integration we get

$\displaystyle \int_{-\infty}^{\infty} \left ( \int_{-\infty}^{\infty} e^{-2 \pi ixy} \varphi(x) dx \right ) g(y)dy$

And the inner integral is simply $\mathscr{F}\varphi$, giving

$\displaystyle \int_{-\infty}^{\infty} \mathscr{F}\varphi(y) g(y) dy = T(\mathscr{F}\varphi)$

Since the pairing of $T$ is defined by integrating with a product of $g$.

But notice that the final form of our expression for $(\mathscr{F}T)(\varphi)$ does not at all require that $T$ is induced by $g$. All that we require is that $\varphi$ is transformable, and that is always true by our assumption that $A$ satisfies our original criterion. This motivates the definition:

DefinitionThe Fourier transform of a generalized function $T$ is defined by $(\mathscr{F}T)(f) = T(\mathscr{F}f)$. Similarly, the inverse Fourier transform is defined by $(\mathscr{F}^{-1}T)(f) = T(\mathscr{F}^{-1}f)$.

This is much neater, and in fact more general than a definition based on integration. The point of deriving it the way we did is so that we can have a theory which reduces to our classical notions given the right assumptions, but can be framed in other, perhaps unexpected contexts. Such is the beauty of mathematics.

Moreover, we note that the “pairing” notation hints at an interesting fact about Fourier transforms. In particular, the above definition says that $\left \langle \mathscr{F}T,f \right \rangle = \left \langle T, \mathscr{F}f \right \rangle$. One familiar with the basics of finite-dimensional functional analysis will recognize this immediately as the condition for $\mathscr{F}$ to be a self-adjoint operator. While we won’t discuss self-adjoint operators in this primer (we’ll save it for a future primer on functional analysis; we foresee this topic surfacing again when we cover support vector machines), we will note for the knowledgable reader that with a few additional conditions, this is precisely what is going on with the Fourier transform. However, since we’re dealing with an infinite-dimensional vector space, we can’t quite say it’s a diagonalizable matrix, but it is “multiplication by a constant” in a sense. The relationship is most evident when again $T$ is induced by a  regular function, and then the pairing is given by integration after multiplication by the “constant” $g$. Additionally, the fact that we represent a generalized function by the inner product notation hints at Riesz-type representation theorems, but we will press the issue no further here.

Sticking in this abstract land of an unknown $A$ a little longer, we can reprove some of the basic facts about Fourier transforms in this general setting. For instance,

$\displaystyle (\mathscr{F}\mathscr{F}^{-1}T)(f) = (\mathscr{F}^{-1}T)(\mathscr{F}f) = T(\mathscr{F}^{-1}\mathscr{F}f) = T(f)$

since the function $f$ is required to satisfy the criteria. So in other words, $\mathscr{F}\mathscr{F}^{-1}T = T$, and it is trivial to see the reverse composition yields the same.

A similarly easy proof recovers the identity that confused us last time, namely that $(\mathscr{FF}T)(f) = -Tf$. We leave it as an exercise to the reader

## Schwartz Functions, and Tempered Distributions

It’s time to forego the stalling and declare what the set of functions $A$ should be to make this all work. This bit is generally considered the hard work and inspired genius of a man named Laurent Schwartz, and so they are appropriately called Schwartz functions.

Definition: The class $S$ of rapidly decreasing functions on $\mathbb{R}$, or Schwartz functions, is the set of the smooth functions $f : \mathbb{R} \to \mathbb{R}$ which satisfy the following condition for all $m, n \in \mathbb{N}$:

$\displaystyle |x|^n \left | \frac{d^m}{dx^m} f(x) \right | \to 0$ as $x \to \pm \infty$

Recall, if forgotten, that smooth simply means infinitely differentiable. While we won’t get into the nitty gritty of proving the following facts (they’re quite analytic, and this author is an algebraist), we will state the important properties of $S$.

First and foremost, $S$ is a vector space which satisfies our criteria for being “good for Fourier transforms.” Another way of saying this is that the Fourier transform is a linear isomorphism on $S$. Second, $S$ includes all smooth functions with compact support; that is, it includes all functions which are nonzero except on a closed and bounded set. Moreover, $S$ includes all Gaussian curves. Third, all functions in $S$ are uniformly continuous. And finally, $S \subset L^p$ for all $p \geq 1$.

In other words, things are as nice as we could ever hope for them to be, with respect to taking Fourier transforms.  Things converge, transforms are defined, and things just work.

But of course, the real meat of the discussion comes when we analyze generalized functions on $S$. When the class is specifically $S$, these generalized functions are called tempered distributions. As we have already seen, the Dirac delta “function” is a tempered distribution. And with this new framework, we can start to compute the Fourier transforms of functions we couldn’t previously. For instance, the Fourier transform of the Dirac delta is the constant 1 function:

$\displaystyle \mathscr{F}\delta(\varphi) = \delta(\mathscr{F}\varphi) = \mathscr{F}\varphi(0)$

But $\mathscr{F}\varphi(0)$ is classically computable, and it’s just $\int_{-\infty}^{\infty}\varphi(x)dx$, which is another way to say $1(\varphi)$, where 1 is understood to be the tempered distribution induced by the constant 1 function. We just showed that $\mathscr{F}\delta(\varphi) = 1(\varphi)$ for all $\varphi \in S$. In other words, $\mathscr{F}\delta = 1$.

This has the nice interpretation that being infinitely concentrated in the time domain (as is the delta) corresponds to being infinitely spread out in the frequency domain. Similarly, being spread out infinitely in the time domain is equivalent to being concentrated at a single point in the frequency domain, as the reader will have no trouble proving that $\mathscr{F}1 = -\delta$. The eager reader will go ahead and find the Fourier transforms of the complex exponential $e^{2\pi iax}$ and $\cos(2 \pi i ax), \sin(2 \pi i ax)$.

## Operations on Tempered Distributions

Now that we have tempered distributions, it makes sense to start investigating the various operations we can perform on them. As we just saw, the Fourier transform is one of them, but it is useful to have a couple of others.

We should note that even some basic operations aren’t defined for generalized functions. For instance, with regular functions $f,g$, we could compute the product $fg$. This is not defined for all generalized functions. In fact, since the space of generalized functions is a vector space, the only kinds of operations we can apply are linear ones. The Fourier transform counts as one, and so does addition and multiplication by a constant. Multiplication is not a linear operation.

On the other hand, if one restricts to tempered distributions, one can compute the product of a tempered distribution with a function in $S$. The derivation of the definition follows the same philosophy that it did for the definition of the Fourier transform, and the computation is quite trivial. In fact, we get something slightly more general:

Definition: Given a function $f$ such that $f\varphi \in S$ for all $\varphi \in S$ and a tempered distribution $T$, we define $fT$ by $fT(\varphi) = T(f\varphi)$.

Continuing with the same derivation philosophy, we can define the derivative of a tempered distribution $T$:

Definition: The derivative of a tempered distribution $T$, denoted $T'$, is defined by $T'(\varphi) = -T(\varphi')$.

A special case of this occurs when $T$ is the delta function; we have $f\delta = f(0)\delta$ as tempered distributions. Moreover, from this definition it is easy to reprove the classical identities $\mathscr{F}T' = (2\pi is)\mathscr{F}T$, and $(\mathscr{F}T)' = \mathscr{F}(-2\pi it T)$.

Finally, while the convolution of two tempered distributions is also undefined in general, some additional hypotheses allow it to work (see here), and then we can regain the theorem that $\mathscr{F}(f * T) = (\mathscr{F}f)(\mathscr{F}T)$. Again, a special case of this is the delta function: $f * \delta =f$ as tempered distributions.

This menagerie of properties all works toward reclaiming the theorems we proved about the classical Fourier transform. For the purpose of this blog, we will henceforth blur the distinction between the classical theory and this more complicated (and correct) theory of Fourier transforms. It’s a lame cop out, we admit, but it allows us to focus on the less pedantic aspects of the theory and applications. This stuff took decades to iron out, and for good reason!

So next time we will continue with discrete Fourier transforms and multidimensional Fourier transforms. Until then!