Linear Programming and Healthy Diets — Part 1

Optimization is by far one of the richest ways to apply computer science and mathematics to the real world. Everybody is looking to optimize something: companies want to maximize profits, factories want to maximize efficiency, investors want to minimize risk, the list just goes on and on. The mathematical tools for optimization are also some of the richest mathematical techniques. They form the cornerstone of an entire industry known as operations research, and advances in this field literally change the world.

The mathematical field is called combinatorial optimization, and the name comes from the goal of finding optimal solutions more efficiently than an exhaustive search through every possibility. This post will introduce the most central problem in all of combinatorial optimization, known as the linear program. Even better, we know how to efficiently solve linear programs, so in future posts we’ll write a program that computes the most affordable diet while meeting the recommended health standard.

Generalizing a Specific Linear Program

Most optimization problems have two parts: an objective function, the thing we want to maximize or minimize, and constraints, rules we must abide by to ensure we get a valid solution. As a simple example you may want to minimize the amount of time you spend doing your taxes (objective function), but you certainly can’t spend a negative amount of time on them (a constraint).

The following more complicated example is the centerpiece of this post. Most people want to minimize the amount of money spent on food. At the same time, one needs to maintain a certain level of nutrition. For males ages 19-30, the United States National Institute for Health recommends 3.7 liters of water per day, 1,000 milligrams of calcium per day, 90 milligrams of vitamin C per day, etc.

We can set up this nutrition problem mathematically, just using a few toy variables. Say we had the option to buy some combination of oranges, milk, and broccoli. Some rough estimates [1] give the following content/costs of these foods. For 0.272 USD you can get 100 grams of orange, containing a total of 53.2mg of calcium, 40mg of vitamin C, and 87g of water. For 0.100 USD you can get 100 grams of whole milk, containing 276mg of calcium, 0mg of vitamin C, and 87g of water. Finally, for 0.381 USD you can get 100 grams of broccoli containing 47mg of calcium, 89.2mg of vitamin C, and 91g of water. Here’s a table summarizing this information:

Nutritional content and prices for 100g of three foods

Food         calcium(mg)     vitamin C(mg)      water(g)   price(USD/100g)
Broccoli     47              89.2               91         0.381
Whole milk   276             0                  87         0.100
Oranges      40              53.2               87         0.272

Some observations: broccoli is more expensive but gets the most of all three nutrients, whole milk doesn’t have any vitamin C but gets a ton of calcium for really cheap, and oranges are a somewhere in between. So you could probably tinker with the quantities and figure out what the cheapest healthy diet is. The problem is what happens when we incorporate hundreds or thousands of food items and tens of nutrient recommendations. This simple example is just to help us build up a nice formality.

So let’s continue doing that. If we denote by $ b$ the number of 100g units of broccoli we decide to buy, and $ m$ the amount of milk and $ r$ the amount of oranges, then we can write the daily cost of food as

$ \displaystyle \text{cost}(b,m,r) = 0.381 b + 0.1 m + 0.272 r$

In the interest of being compact (and again, building toward the general linear programming formulation) we can extract the price information into a single cost vector $ c = (0.381, 0.1, 0.272)$, and likewise write our variables as a vector $ x = (b,m,r)$. We’re implicitly fixing an ordering on the variables that is maintained throughout the problem, but the choice of ordering doesn’t matter. Now the cost function is just the inner product (dot product) of the cost vector and the variable vector $ \left \langle c,x \right \rangle$. For some reason lots of people like to write this as $ c^Tx$, where $ c^T$ denotes the transpose of a matrix, and we imagine that $ c$ and $ x$ are matrices of size $ 3 \times 1$. I’ll stick to using the inner product bracket notation.

Now for each type of food we get a specific amount of each nutrient, and the sum of those nutrients needs to be bigger than the minimum recommendation. For example, we want at least 1,000 mg of calcium per day, so we require that $ 1000 \leq 47b + 276m + 40r$. Likewise, we can write out a table of the constraints by looking at the columns of our table above.

$ \displaystyle \begin{matrix} 91b & + & 87m & + & 87r & \geq & 3700 & \text{(water)}\\ 47b & + & 276m & + & 40r & \geq & 1000 & \text{(calcium)} \\ 89.2b & + & 0m & + & 53.2r & \geq & 90 & \text{(vitamin C)} \end{matrix}$

In the same way that we extracted the cost data into a vector to separate it from the variables, we can extract all of the nutrient data into a matrix $ A$, and the recommended minimums into a vector $ v$. Traditionally the letter $ b$ is used for the minimums vector, but for now we’re using $ b$ for broccoli.

$ A = \begin{pmatrix} 91 & 87 & 87 \\ 47 & 276 & 40 \\ 89.2 & 0 & 53.2 \end{pmatrix}$

$ v = \begin{pmatrix} 3700 \\ 1000 \\ 90 \end{pmatrix}$

And now the constraint is that $ Ax \geq v$, where the $ \geq$ means “greater than or equal to in every coordinate.” So now we can write down the more general form of the problem for our specific matrices and vectors. That is, our problem is to minimize $ \left \langle c,x \right \rangle$ subject to the constraint that $ Ax \geq v$. This is often written in offset form to contrast it with variations we’ll see in a bit:

$ \displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to the constraint } Ax \geq v$

In general there’s no reason you can’t have a “negative” amount of one variable. In this problem you can’t buy negative broccoli, so we’ll add the constraints to ensure the variables are nonnegative. So our final form is

$ \displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to } Ax \geq v \\ \text{and } x \geq 0$

In general, if you have an $ m \times n$ matrix $ A$, a “minimums” vector $ v \in \mathbb{R}^m$, and a cost vector $ c \in \mathbb{R}^n$, the problem of finding the vector $ x$ that minimizes the cost function while meeting the constraints is called a linear programming problem or simply a linear program.

To satiate the reader’s burning curiosity, the solution for our calcium/vitamin C problem is roughly $ x = (1.01, 41.47, 0)$. That is, you should have about 100g of broccoli and 4.2kg of milk (like 4 liters), and skip the oranges entirely. The daily cost is about 4.53 USD. If this seems awkwardly large, it’s because there are cheaper ways to get water than milk.

100-grams-broccoli

100g of broccoli (image source: 100-grams.blogspot.com)

[1] Water content of fruits and veggiesFood costs in March 2014 in the midwest, and basic known facts about the water density/nutritional content of various foods.

Duality

Now that we’ve seen the general form a linear program and a cute example, we can ask the real meaty question: is there an efficient algorithm that solves arbitrary linear programs? Despite how widely applicable these problems seem, the answer is yes!

But before we can describe the algorithm we need to know more about linear programs. For example, say you have some vector $ x$ which satisfies your constraints. How can you tell if it’s optimal? Without such a test we’d have no way to know when to terminate our algorithm. Another problem is that we’ve phrased the problem in terms of minimization, but what about problems where we want to maximize things? Can we use the same algorithm that finds minima to find maxima as well?

Both of these problems are neatly answered by the theory of duality. In mathematics in general, the best way to understand what people mean by “duality” is that one mathematical object uniquely determines two different perspectives, each useful in its own way. And typically a duality theorem provides one with an efficient way to transform one perspective into the other, and relate the information you get from both perspectives. A theory of duality is considered beautiful because it gives you truly deep insight into the mathematical object you care about.

In linear programming duality is between maximization and minimization. In particular, every maximization problem has a unique “dual” minimization problem, and vice versa. The really interesting thing is that the variables you’re trying to optimize in one form correspond to the contraints in the other form! Here’s how one might discover such a beautiful correspondence. We’ll use a made up example with small numbers to make things easy.

So you have this optimization problem

$ \displaystyle \begin{matrix}
\text{minimize} & 4x_1+3x_2+9x_3 & \\
\text{subject to} & x_1+x_2+x_3 & \geq 6 \\
& 2x_1+x_3 & \geq 2 \\
& x_2+x_3 & \geq 1 & \\ & x_1,x_2,x_3 & \geq 0 \end{matrix}$

Just for giggles let’s write out what $ A$ and $ c$ are.

$ \displaystyle A = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 0 & 1 \\ 0 & 1 & 1 \end{pmatrix}, c = (4,3,9), v = (6,2,1)$

Say you want to come up with a lower bound on the optimal solution to your problem. That is, you want to know that you can’t make $ 4x_1 + 3x_2 + 9x_3$ smaller than some number $ m$. The constraints can help us derive such lower bounds. In particular, every variable has to be nonnegative, so we know that $ 4x_1 + 3x_2 + 9x_3 \geq x_1 + x_2 + x_3 \geq 6$, and so 6 is a lower bound on our optimum. Likewise,

$ \displaystyle \begin{aligned}4x_1+3x_2+9x_3 & \geq 4x_1+4x_3+3x_2+3x_3 \\ &=2(2x_1 + x_3)+3(x_2+x_3) \\ & \geq 2 \cdot 2 + 3 \cdot 1 \\ &=7\end{aligned}$

and that’s an even better lower bound than 6. We could try to write this approach down in general: find some numbers $ y_1, y_2, y_3$ that we’ll use for each constraint to form

$ \displaystyle y_1(\text{constraint 1}) + y_2(\text{constraint 2}) + y_3(\text{constraint 3})$

To make it a valid lower bound we need to ensure that the coefficients of each of the $ x_i$ are smaller than the coefficients in the objective function (i.e. that the coefficient of $ x_1$ ends up less than 4). And to make it the best lower bound possible we want to maximize what the right-hand-size of the inequality would be: $ y_1 6 + y_2 2 + y_3 1$. If you write out these equations and the constraints you get our “lower bound” problem written as

$ \displaystyle \begin{matrix} \text{maximize} & 6y_1 + 2y_2 + y_3 & \\ \text{subject to} & y_1 + 2y_2 & \leq 4 \\ & y_1 + y_3 & \leq 3 \\ & y_1+y_2 + y_3 & \leq 9 \\ & y_1,y_2,y_3 & \geq 0 \end{matrix}$

And wouldn’t you know, the matrix providing the constraints is $ A^T$, and the vectors $ c$ and $ v$ switched places.

$ \displaystyle A^T = \begin{pmatrix} 1 & 2 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{pmatrix}$

This is no coincidence. All linear programs can be transformed in this way, and it would be a useful exercise for the reader to turn the above maximization problem back into a minimization problem by the same technique (computing linear combinations of the constraints to make upper bounds). You’ll be surprised to find that you get back to the original minimization problem! This is part of what makes it “duality,” because the dual of the dual is the original thing again. Often, when we fix the “original” problem, we call it the primal form to distinguish it from the dual form. Usually the primal problem is the one that is easy to interpret.

(Note: because we’re done with broccoli for now, we’re going to use $ b$ to denote the constraint vector that used to be $ v$.)

Now say you’re given the data of a linear program for minimization, that is the vectors $ c, b$ and matrix $ A$ for the problem, “minimize $ \left \langle c, x \right \rangle$ subject to $ Ax \geq b; x \geq 0$.” We can make a general definition: the dual linear program is the maximization problem “maximize $ \left \langle b, y \right \rangle$ subject to $ A^T y \leq c, y \geq 0$.” Here $ y$ is the new set of variables and the superscript T denotes the transpose of the matrix. The constraint for the dual is often written $ y^T A \leq c^T$, again identifying vectors with a single-column matrices, but I find the swamp of transposes pointless and annoying (why do things need to be columns?).

Now we can actually prove that the objective function for the dual provides a bound on the objective function for the original problem. It’s obvious from the work we’ve done, which is why it’s called the weak duality theorem.

Weak Duality Theorem: Let $ c, A, b$ be the data of a linear program in the primal form (the minimization problem) whose objective function is $ \left \langle c, x \right \rangle$. Recall that the objective function of the dual (maximization) problem is $ \left \langle b, y \right \rangle$. If $ x,y$ are feasible solutions (satisfy the constraints of their respective problems), then

$ \left \langle b, y \right \rangle \leq \left \langle c, x \right \rangle$

In other words, the maximum of the dual is a lower bound on the minimum of the primal problem and vice versa. Moreover, any feasible solution for one provides a bound on the other.

Proof. The proof is pleasingly simple. Just inspect the quantity $ \left \langle A^T y, x \right \rangle = \left \langle y, Ax \right \rangle$. The constraints from the definitions of the primal and dual give us that

$ \left \langle y, b \right \rangle \leq \left \langle y, Ax \right \rangle = \left \langle A^Ty, x \right \rangle \leq \left \langle c,x \right \rangle$

The inequalities follow from the linear algebra fact that if the $ u$ in $ \left \langle u,v \right \rangle$ is nonnegative, then you can only increase the size of the product by increasing the components of $ v$. This is why we need the nonnegativity constraints.

In fact, the world is much more pleasing. There is a theorem that says the two optimums are equal!

Strong Duality Theorem: If there are any solutions $ x,y$ to the primal (minimization) problem and the dual (maximization) problem, respectively, then the two problems also have optimal solutions $ x^*, y^*$, and two candidate solutions $ x^*, y^*$ are optimal if and only if they produce equal objective values $ \left \langle c, x^* \right \rangle = \left \langle y^*, b \right \rangle$.

The proof of this theorem is a bit more convoluted than the weak duality theorem, and the key technique is a lemma of Farkas and its variations. See the second half of these notes for a full proof. The nice thing is that this theorem gives us a way to tell if an algorithm to solve linear programs is done: maintain a pair of feasible solutions to the primal and dual problems, improve them by some rule, and stop when the two solutions give equal objective values. The hard part, then, is finding a principled and guaranteed way to improve a given pair of solutions.

On the other hand, you can also prove the strong duality theorem by inventing an algorithm that provably terminates. We’ll see such an algorithm, known as the simplex algorithm in the next post. Sneak peek: it’s a lot like Gaussian elimination. Then we’ll use the algorithm (or an equivalent industry-strength version) to solve a much bigger nutrition problem.

In fact, you can do a bit better than the strong duality theorem, in terms of coming up with a stopping condition for a linear programming algorithm. You can observe that an optimal solution implies further constraints on the relationship between the primal and the dual problems. In particular, this is called the complementary slackness conditions, and they essentially say that if an optimal solution to the primal has a positive variable then the corresponding constraint in the dual problem must be tight (is an equality) to get an optimal solution to the dual. The contrapositive says that if some constraint is slack, or a strict inequality, then either the corresponding variable is zero or else the solution is not optimal. More formally,

Theorem (Complementary Slackness Conditions): Let $ A, c, b$ be the data of the primal form of a linear program, “minimize $ \left \langle c, x \right \rangle$ subject to $ Ax \geq b, x \geq 0$.” Then $ x^*, y^*$ are optimal solutions to the primal and dual problems if any only if all of the following conditions hold.

  • $ x^*, y^*$ are both feasible for their respective problems.
  • Whenever $ x^*_i > 0$ the corresponding constraint $ A^T_i y^* = c_i$ is an equality.
  • Whenever $ y^*_j > 0$ the corresponding constraint $ A_j x^* = b_j$ is an equality.

Here we denote by $ M_i$ the $ i$-th row of the matrix $ M$ and $ v_i$ to denote the $ i$-th entry of a vector. Another way to write the condition using vectors instead of English is

$ \left \langle x^*, A^T y^* – c \right \rangle = 0$
$ \left \langle y^*, Ax^* – b \right \rangle$

The proof follows from the duality theorems, and just involves pushing around some vector algebra. See section 6.2 of these notes.

One can interpret complementary slackness in linear programs in a lot of different ways. For us, it will simply be a termination condition for an algorithm: one can efficiently check all of these conditions for the nonzero variables and stop if they’re all satisfied or if we find a variable that violates a slackness condition. Indeed, in more mature optimization analyses, the slackness condition that is more egregiously violated can provide evidence for where a candidate solution can best be improved. For a more intricate and detailed story about how to interpret the complementary slackness conditions, see Section 4 of these notes by Joel Sobel.

Finally, before we close we should note there are geometric ways to think about linear programming. I have my preferred visualization in my head, but I have yet to find a suitable animation on the web that replicates it. Here’s one example in two dimensions. The set of constraints define a convex geometric region in the plane

The constraints define a convex area of "feasible solutions." Image source: Wikipedia.

The constraints define a convex area of “feasible solutions.” Image source: Wikipedia.

Now the optimization function $ f(x) = \left \langle c,x \right \rangle$ is also a linear function, and if you fix some output value $ y = f(x)$ this defines a line in the plane. As $ y$ changes, the line moves along its normal vector (that is, all these fixed lines are parallel). Now to geometrically optimize the target function, we can imagine starting with the line $ f(x) = 0$, and sliding it along its normal vector in the direction that keeps it in the feasible region. We can keep sliding it in this direction, and the maximum of the function is just the last instant that this line intersects the feasible region. If none of the constraints are parallel to the family of lines defined by $ f$, then this is guaranteed to occur at a vertex of the feasible region. Otherwise, there will be a family of optima lying anywhere on the line segment of last intersection.

In higher dimensions, the only change is that the lines become affine subspaces of dimension $ n-1$. That means in three dimensions you’re sliding planes, in four dimensions you’re sliding 3-dimensional hyperplanes, etc. The facts about the last intersection being a vertex or a “line segment” still hold. So as we’ll see next time, successful algorithms for linear programming in practice take advantage of this observation by efficiently traversing the vertices of this convex region. We’ll see this in much more detail in the next post.

Until then!

The Fourier Transform — A Primer

In our last primer we saw the Fourier series, which flushed out the notion that a periodic function can be represented as an infinite series of sines and cosines. While this is fine and dandy, and quite a powerful tool, it does not suffice for the real world. In the real world, very little is truly periodic, especially since human measurements can only record a finite period of time. Even things we wish to explore on this blog are hardly periodic (for instance, image analysis). To compensate, we’ll develop the Fourier transform, which can be thought of as a limiting case of the Fourier series.

We will approach the Fourier transform from two different angles. First, we will take the “limiting case of the Fourier series” notion as far as it will take us (this will be quite far) to motivate the appropriate definitions. In this naive world, we will perform a few interesting computations, and establish the most useful properties of the Fourier transform as an operation. On the other hand, this naive world will be fraught with hand-waving and mathematical laxity. We will make statements that analysts (and people who know about the issues of convergence) would find uncouth.

And so we will redevelop the bulk of the theory from the ground up, and define the Fourier transform on a larger class of things called distributions. In doing so, we will circumvent (or rather, properly handle) all of the issues of convergence.

Fourier and Naivete

The Fourier transform is best thought of as an operation on functions which has some nice properties. One such property is linearity, and a more complex one is its effect on the convolution operation. But if one wants to know where the transform comes from, and this is an important thing to know, then we could interpret it as a generalization of the Fourier series coefficient. Similarly, its inverse can be thought of as a generalization of the Fourier series representation (or, the map taking a function to its Fourier series representation). The generalization we speak of, and the “limiting case” we mentioned earlier, is in the size of the period. In rough terms,

The Fourier transform is the limit of the Fourier coefficient as the period of the function tends to infinity.

This is how we will develop the definition of the Fourier transform, and the reader should understand why this is a sensible place to start: a function which has no period is simply a function which has an infinitely large period. Colloquially, one can “spot” periodicity much easier if the period is shorter, and as the period increases, functions start to “look” less and less periodic. So in the limiting case, there is no period.

In order to do this correctly, we should alter some of the definitions we made in our post on Fourier series. Specifically, we want to have an arbitrary period $ T$. So instead of making the 1-periodic complex exponential $ e^{2\pi i k t}$ our basic building block, we choose $ e^{2 \pi i t k/T}$. The reader will check that as $ k$ varies over all integers, these new complex exponentials still form an orthonormal basis of $ L^2$ (well not quite, we have to modify the inner product slightly; see below). Then, using the notation we used last time for the Fourier coefficients $ c_k$, the series is calculated as

$ \displaystyle \sum_{k = -\infty}^{\infty} c_k e^{\frac{2 \pi i k t}{T}}$

where the $ c_k$ are computed via the new inner product:

$ \displaystyle c_k = \frac{1}{T}\int_{0}^{T}e^{\frac{-2 \pi i k t}{T}} f(t)dt$

We make another slight alteration in the limits of integration:

$ \displaystyle c_k = \frac{1}{T} \int_{-T/2}^{T/2} e^{\frac{-2 \pi i k t}{T}} f(t)dt$

This doesn’t change the integral, since a  $ T$-periodic function has the same integral on any interval of length $ T$.

Before we continue, we should show an intuitive aspect of what happens when $ T \to \infty$. We can think of the usual Fourier series representation of a 1-periodic function $ f$ as a function on the integers whose values are the Fourier coefficients $ k \mapsto c_k$, since the coefficients completely determine the representation of $ f$.  The interval between the inputs to this mapping is 1, because the function is 1-periodic. If we generalize this to an arbitrary period $ T$, then we have functions whose inputs are multiples of $ 1/T$. So the intervals between adjacent inputs shrink to $ 1/T$. As $ T$ grows, the inputs are moving closer and closer together. In other words, the Fourier series representation is becoming a continuous mapping of the frequency! This viewpoint will motivate our seemingly magical choices to follow, and it partially justifies the common notion that the Fourier transform takes a function “in the time domain” and represents it “in the frequency domain.” It’s a stupid notion for mathematicians, but everyone says it anyway.

So if we try to take the limit immediately, that is, if we use the exact formula above for $ c_k$ and try to evaluate $ \lim_{T \to \infty} c_k$, we have issues. The problem rears it’s ugly head when we let $ f$ be a function with bounded support (that is, $ f(x)$ is zero everywhere except possibly on a finite interval). If $ f$ is zero outside of $ [a,b]$ and $ \int_a^b |f(x)|dx \leq M$ is finite, then for some large $ T$, we have that all of the Fourier coefficients go to zero as $ T \to \infty$. The details:

$ \displaystyle |c_k| = \left | \frac{1}{T}\int_{-T/2}^{T/2} e^{\frac{-2 \pi i kt}{T}} f(t)dt \right | \leq \frac{1}{T} \int_{-T/2}^{T/2} \left | e^{\frac{-2 \pi i k t}{T}} \right | |f(t)| dt$

But as the absolute value of the complex exponential is 1, we can bound this by

$ \displaystyle \frac{1}{T} \int_{-T/2}^{T/2} |f(t)|dt \leq \frac{M}{T}$,

and as $ T \to \infty$, we see that the whole thing goes to 0.

The solution is (magic!) to scale linearly by $ T$, and pull the balancing factor of $ 1/T$ outside of the coefficients $ c_k$, and into the Fourier series itself. In other words, our new Fourier series is (written with the terms rearranged for good reason)

$ \displaystyle \sum_{k = -\infty}^{\infty} e^{\frac{-2 \pi i k t}{T}} c_k (1/T) \frac{1}{T}$

where the coefficients $ c_k(1/T)$ are

$ \displaystyle c_k(1/T) = \int_{-T/2}^{T/2}e^{\frac{-2 \pi i k t}{T}} f(t)dt$

Now we suddenly (magically!) realize that the first equation is just the usual Riemann sum from calculus for the estimate of an integral. If we think of $ x$ as our variable, then we’d be integrating $ e^{-2 \pi i t x} c_k(x)$. And in particular, when the interval between the $ x$ values goes to 0, the discrete sum converges to an integral by definition. Let us denote the infinitesimal variable $ s$ to represent this “limit of $ 1/T$.” Then we redefine the two above equations with glorious new names:

Definition: The Fourier transform of a function $ f: \mathbb{R} \to \mathbb{C}$ is the integral

$ \displaystyle \mathscr{F}f(s) = \int_{-\infty}^{\infty}e^{-2 \pi i s t} f(t)dt$

whenever such an integral converges.

The inverse Fourier transform of $ g$ is the integral

$ \displaystyle \mathscr{F}^{-1}g(t) = \int_{-\infty}^{\infty} e^{2\pi i s t}g(s)ds$

whenever such an integral converges.

And so, the Fourier transform above generalizes the Fourier coefficient $ c_k$ (the limits of integration go to infinity), while the inverse transform generalizes the Fourier series reconstruction, by our conversion from a discrete sum to an integral.

We should note a few things about this definition, because it is quite a mouthful. First, $ \mathscr{F}$ and $ \mathscr{F}^{-1}$ operate on functions. In other words, they accept a function as input, and their values are functions. Still in other words, the parentheses are like $ (\mathscr{F}f)(s)$, and not like $ \mathscr{F}(f(s))$. We will often omit the parentheses with this implicitly understood precedence. This is also part of why we choose different variables. $ f(t)$ will often use a different variable than its transform $ \mathscr{F}f(s)$.

Second, returning to our remark about stupid notions, the function $ f(t)$ can be thought of as being in the “time domain,” where the inputs are instances of time, while the transformed function $ \mathscr{F}f$ is in the “frequency domain.” That is, for a given input $ s$, the Fourier transform $ \mathscr{F}f(s)$ describes how the complex exponential with frequency $ s$ contributes to the overall function $ f$. The set of values of the Fourier transform of $ f$ is called the spectrum of $ f$. One can visualize the spectrum, but only indirectly. A complex-valued function is always hard to visualize, so instead one graphs $ |\mathscr{F}f(s)|$ as a function on the real numbers. We then get pretty pictures like this one giving the spectrum of some human-spoken words:

This also explains the humorous comic at the beginning of this post: the thing saying “meow” is the spectrum of a cat, complete with whiskers. The comic also reinforces the idea that $ \mathscr{F}$ is simply an operation on functions. One does not need to restrict $ \mathscr{F}$ to operate on functions whose domain is time (indeed, a cat is not a function of time). It’s just an instance in which one can concretely interpret the transform of a function. For example, if one wanted to (and we will shortly), one could wonder about $ \mathscr{FF}f$, or even apply $ \mathscr{F}$ arbitrarily many times and see what happens under the limit. The same thing goes for the inverse Fourier transform.

The last big issue we have with the above definition is that it only makes sense when the integral actually converges. We will run into a few examples where this becomes a big problem, but we will sweep these issues under the rug for now (remember, this is still the land of naivete).

Nevertheless, we can do some wonderful computations with this mentality and this new definition. It will benefit us in the long run, because we’ll discover the useful properties of the Fourier transform now, and use those properties to steer the more rigorous formalities later.

Elementary Transforms and Elementary Properties

Armed with the mighty definition of the Fourier transform, we can take two paths. We can compute the transforms of various elementary functions, or we can develop tools to construct transforms of combinations of functions by computing the transforms of their constituent parts. This is largely the same approach one takes in studying derivatives and integrals in classical calculus: one learns to compute the derivatives of polynomials, logarithms, and exponentials; and then one learns to compute derivatives of products, sums, and compositions of functions. We will operate with the same philosophy for now.

Example: Let $ f = \chi_{[-1/2, 1/2]}$ be the characteristic function of the interval from -1/2 to 1/2. That is, it is the function which is 1 on that interval and zero everywhere else. We will show $ \mathscr{F}f = \frac{\sin(\pi s)}{\pi s}$ by appealing directly to the definition of the Fourier transform.

$ \displaystyle \mathscr{F}f(s) = \int_{-\infty}^{\infty} e^{-2 \pi i s t} \chi_{[-1/2, 1/2]}(t) dt$

Since $ f$ is zero outside of the chosen interval, and one inside, we can simplify the integral by adjusting the limits to -1/2 and 1/2, and inside simply using $ f(t) = 1$:

$ \displaystyle \mathscr{F}f(s) = \int_{-1/2}^{1/2}e^{-2 \pi ist} dt$

And this is quite tractable. Integrating the complex exponential as usual, we have:

$ \displaystyle \mathscr{F}f(s) = \left ( \frac{1}{-2 \pi is} e^{-2 \pi ist} \right ) \Big |_{-1/2}^{1/2} = \frac{e^{- \pi ist} – e^{\pi ist}}{-2 \pi is} = \frac{\sin(\pi s)}{\pi s}$

Where the last equality follows from the classic identity $ e^{ix} = \cos(x) + i\sin(x)$. The result of this Fourier transform is so pervasive that it has it’s own name: $ \textup{sinc}(s) = \frac{\sin(\pi s)}{\pi s}$.

Exercise: Let $ \Lambda(t)$ be the piecewise function defined as $ 1 – |t|$ if $ |t| < 1$, and zero otherwise. Prove that $ \mathscr{F}\Lambda(s) = \textup{sinc}^2(s) = \frac{\sin^2(\pi s)}{(\pi s)^2}$.

Again, this one follows straight from the definition, which must be computed piecewise to handle the absolute value.

Example: Let $ f$ be the Gaussian $ f(t) = e^{- \pi t^2}$. Then $ \mathscr{F}f(s) = f(s)$. That is, the Gaussian is fixed by the Fourier transform.

This is a very special property of the Gaussian, hinting at the special relationship between Fourier transforms and smoothness of curves. In order to prove it we need to borrow a fact from complex analysis, that $ \int_{-\infty}^{\infty} e^{- \pi t^2} = 1$. Note that here the indefinite integral of $ f$ cannot be expressed in elementary terms, so basic calculus tools are insufficient to prove this fact. A proof is most easily accessible using complex integration and residue theory, and Wikipedia provides a proof that does the same thing using a real parameterization to make it seem more elementary.

To find the Fourier transform, we again appeal to the definition, except this time we use some tricks. First, we differentiate the definition with respect to $ s$, and then integrate the result by parts to arrive at an ordinary differential equation, which we know how to solve. Set $ F(s) = \mathscr{F}f(s)$ for ease of notation.

$ \displaystyle F(s) = \int_{-\infty}^{\infty} e^{-2 \pi ist} e^{-\pi t^2}dt$

Differentiating with respect to $ s$, we have

$ \displaystyle F'(s) = \frac{d}{ds}\int_{-\infty}^{\infty} e^{-2 \pi ist}e^{-\pi t^2}dt = \int_{-\infty}^{\infty} \frac{d}{ds}(e^{-2 \pi ist}) e^{-\pi t^2} dt$

Performing the differentiation and regrouping terms we have

$ \displaystyle F'(s) = i \int_{-\infty}^{\infty}e^{-2\pi ist}(-2\pi t e^{-\pi t^2}) dt$

Now integrating by parts with respect to $ t$, and recognizing that the term $ e^{-2 \pi ist} e^{-\pi t^2}$ tends to zero both as $ t \to \pm \infty$, we get

$ \displaystyle F'(s) = -i \int_{-\infty}^{\infty} e^{-\pi t^2}(-2\pi is)e^{-2 \pi ist} dt = -2\pi s F(s)$

As we claimed earlier, this is a simple ordinary differential equation, which has solution

$ \displaystyle F(s) = F(0)e^{-\pi s^2} = F(0)f(s)$

And here $ F(0) = \int_{-\infty}^{\infty} e^{-2 \pi i t 0}e^{-\pi t^2} dt = 1$, as we claimed from the beginning. This completes the proof, as $ F(s) = f(s)$.

Next, we will focus on the rules for taking Fourier transforms of functions combined in various ways. First, we note that the Fourier transform is linear, in the very same sense as linear maps between vector spaces in elementary linear algebra. In particular, the linearity of the integral gives

$ \displaystyle \mathscr{F}(af + bg)(s) = a \mathscr{F}f(s) + b\mathscr{F}g(s)$

Other easy properties arise from modifying the input of $ f$, and using multiplicative properties of the complex exponential. For instance, if we let $ f_a(t) = f(t – a)$, we see that $ \mathscr{F}f_a(s) = e^{-2 \pi ias}\mathscr{F}f(s)$. This follows by a simple change of variables in the integral. Letting $ u = t-a$,

$ \displaystyle \mathscr{F}f_a(s) = \int_{-\infty}^{\infty} e^{-2\pi is(u+a)}f(u)du$

And we can trivially factor out the needed complex exponential coefficient, and work with $ u$ as usual. One convenient interpretation of this formula is that a shift in time corresponds to a phase shift in frequency. Once again, we caution that these interpretations are a tool; they can massively confuse the issue when, say, the domain of $ f$ is frequency to begin with.

Similar considerations give us a formula for the scaled function $ f(at)$ whose transform is $ \frac{1}{|a|} \mathscr{F}f(\frac{s}{a})$. We leave this as an exercise to the reader (hint: break it into two cases for when $ a<0, a>0$).

Next, we note that the Fourier transform turns derivatives of some functions into a very manageable product. Rigorously, if $ \lim_{t \to \pm \infty} f(t) = 0$ then

$ \displaystyle \mathscr{F}\frac{d^nf}{dt^n}(s) = (2\pi i s)^n \mathscr{F}f(s)$

We can prove this by induction. We’ll just prove the base case:

$ \displaystyle \mathscr{F}f'(s) = \int_{-\infty}^{\infty} e^{-2 \pi ist}f'(t)dt$

Integrating by parts, we get

$ \displaystyle \mathscr{F}f'(s) = e^{-2\pi ist}f(t) – \int_{-\infty}^{\infty} (-2\pi is)e^{-2\pi ist}f(t)dt$

And by our boundedness property and the fact that the complex exponential has a constant norm, the first term (evaluated from $ -\infty$ to $ \infty$) tends to zero, leaving our desired product. The inductive step follows with the ease of iterated integration by parts. Note that although this example only holds for functions which tend to zero at $ \pm \infty$, next time we will rectify the situation by restricting our theory to functions which are “the best candidates” for the theory of Fourier analysis and eliminate the need for such hypotheses.

The More Interesting Properties

The final two properties of the Fourier transform that we will inspect are in a sense deeper and more substantial than those above. In particular, we will establish the duality of the Fourier transform, and the effect of Fourier transforms on convolution.

First, the Fourier transform has a few notions of duality. Let $ f^-(t)$ denote $ f(-t)$. One such duality notion is the following, which is a trivial consequence of the definitions of the transform and its inverse:

$ \displaystyle (\mathscr{F}f)^- = \mathscr{F}^{-1}f$

Similarly, a minor change of variables shows that $ \mathscr{F}(f^-) = \mathscr{F}^{-1}f$. Chaining these together, we have the nice identity

$ \displaystyle \mathscr{F}(f^-) = (\mathscr{F}f)^-$

A simple corollary is that $ \mathscr{FF}f = f^-$. This allows us to compute the Fourier transforms of some potentially unmanageable functions. For instance, let us return to our friend the sinc function.

$ \displaystyle \mathscr{F}\textup{sinc}(s) = \mathscr{FF}(\chi_{[-1/2,1/2]}) = \chi^-_{[-1/2, 1/2]} = \chi_{[-1/2, 1/2]}$

by the symmetry of the characteristic function. On the other hand, it’s ridiculously counterintuitive that the following integral is actually the characteristic function of a finite interval:

$ \displaystyle \int_{-\infty}^{\infty} e^{-2 \pi ist} \frac{\sin(\pi t)}{\pi t} dt$

In fact, even though we just “proved” that the sinc function has a nice transform, it is hardly clear how to integrate it. In fact, the sinc function is not even (Lebesgue) integrable! Without further qualifications, the above expression is complete nonsense.

Historically, this is the point at which the physicists contact the mathematicians and say, “We dun broked it!” Because the physicists went ahead and used these naively impossible transforms to do amazing things and discover elegant identities, the mathematicians are left to design a sensible theory to support their findings. The field is rife with such inconsistencies, and this is not the last one we will see before consolidating the theory. Perhaps this is in part because successful applications in engineering outpace mathematical rigor. Glibly, they’re racing for profit while the mathematicians want to architect a flawless foundation in which deep theorems are manifestly obvious.

Getting back to the properties of Fourier transforms, we have saved perhaps the most useful one for last. In short, Fourier transforms turn convolution into multiplication. Convolutions, both continuous and discrete, make cameo appearances all over applied mathematics, from signal processing and image analysis to quantum mechanics and mathematical finance. In our applications, we will use the following properties of convolution to modify the spectrum of a signal, for such purposes as removing noise, or filtering out low/high/mid frequency regions. Without further ado, the definition:

Definition: The convolution of $ f$ and $ g$, denoted $ f \ast g$, is the integral

$ \displaystyle (f \ast g)(x) = \int_{-\infty}^{\infty} g(x-y)f(y)dy,$

should such an integral converge. Otherwise the convolution is undefined.

Often convolution is interpreted as some sort of stretch+translation of a function by another, but we find such meager interpretations mathematically flaccid. Convolution is simply an operation that combines functions in an interesting way (perhaps its definition is motivated by the question below). Nevertheless, Wikipedia provides a number of relevant animations showing convolution in action.

So the leading question here is, what happens when one takes the product of $ (\mathscr{F}f)(\mathscr{F}g)$? From the definition, this is

$ \displaystyle \left ( \int_{-\infty}^{\infty}e^{-2\pi isx}f(x)dx \right ) \left ( \int_{-\infty}^{\infty} e^{-2 \pi ist} g(t) dt \right )$

We may combine the integrals into a double integral, and further combine the complex exponentials, getting

$ \displaystyle \int_{-\infty}^{\infty} \left ( \int_{-\infty}^{\infty} e^{-2 \pi is (t+x)}g(t) dt \right ) f(x) dx$

Substituting $ u = t+x$, we have

$ \displaystyle \int_{-\infty}^{\infty} \left ( \int_{-\infty}^{\infty} e^{-2\pi isu} g(u-x)du \right ) f(x) dx$

And swapping the order of integration,

$ \displaystyle \int_{-\infty}^{\infty} \left ( \int_{-\infty}^{\infty} g(u-x)f(x)dx \right ) e^{-2\pi isu} du = \mathscr{F}(g \ast f)$

(The parenthetical quantity drove our definition of the convolution to begin with.) And so we have the beautiful identity:

$ \mathscr{F}(f \ast g)(s) = \mathscr{F}f(s) \mathscr{F}g(s)$

We will use this as follows: multiply the Fourier transform of a signal by an appropriate characteristic function (the characteristic function of the set of “good” frequencies of, say, a sound clip) and then take the inverse transform of the product, getting as a result a modified signal with certain frequencies removed.

There are a few hurdles between here and there (at least, as far as this blog goes). First, we must compensate for our convergence naivete with mathematical rigor. Next time, we will define the class of Schwartz functions, from which we will derive a class of “generalized functions,” intuitively constituting the class of “transformable” functions. After that, we must needs find a suitable discrete approximation of the Fourier transform. In real life, all signals are sampled sequences of numbers. As such, we cannot take their integrals, and must convert these continuous notions to operations on sequences. Finally, we need to investigate an algorithm to efficiently compute such a discrete Fourier transform. Then, and only then, may we proceed with writing programs to do great things.

So look forward to all of the above in the coming weeks. Until next time!