Negacyclic Polynomial Multiplication

In this article I’ll cover three techniques to compute special types of polynomial products that show up in lattice cryptography and fully homomorphic encryption. Namely, the negacyclic polynomial product, which is the product of two polynomials in the quotient ring $\mathbb{Z}[x] / (x^N + 1)$. As a precursor to the negacyclic product, we’ll cover the simpler cyclic product.

All of the Python code written for this article is on GitHub.

The DFT and Cyclic Polynomial Multiplication

A recent program gallery piece showed how single-variable polynomial multiplication could be implemented using the Discrete Fourier Transform (DFT). This boils down to two observations:

  1. The product of two polynomials $f, g$ can be computed via the convolution of the coefficients of $f$ and $g$.
  2. The Convolution Theorem, which says that the Fourier transform of a convolution of two signals $f, g$ is the point-wise product of the Fourier transforms of the two signals. (The same holds for the DFT)

This provides a much faster polynomial product operation than one could implement using the naïve polynomial multiplication algorithm (though see the last section for an implementation anyway). The DFT can be used to speed up large integer multiplication as well.

A caveat with normal polynomial multiplication is that one needs to pad the input coefficient lists with enough zeros so that the convolution doesn’t “wrap around.” That padding results in the output having length at least as large as the sum of the degrees of $f$ and $g$ (see the program gallery piece for more details).

If you don’t pad the polynomials, instead you get what’s called a cyclic polynomial product. More concretely, if the two input polynomials $f, g$ are represented by coefficient lists $(f_0, f_1, \dots, f_{N-1}), (g_0, g_1, \dots, g_{N-1})$ of length $N$ (implying the inputs are degree at most $N-1$, i.e., the lists may end in a tail of zeros), then the Fourier Transform technique computes

\[ f(x) \cdot g(x) \mod (x^N – 1) \]

This modulus is in the sense of a quotient ring $\mathbb{Z}[x] / (x^N – 1)$, where $(x^N – 1)$ denotes the ring ideal generated by $x^N-1$, i.e., all polynomials that are evenly divisible by $x^N – 1$. A particularly important interpretation of this quotient ring is achieved by interpreting the ideal generator $x^N – 1$ as an equation $x^N – 1 = 0$, also known as $x^N = 1$. To get the canonical ring element corresponding to any polynomial $h(x) \in \mathbb{Z}[x]$, you “set” $x^N = 1$ and reduce the polynomial until there are no more terms with degree bigger than $N-1$. For example, if $N=5$ then $x^{10} + x^6 – x^4 + x + 2 = -x^4 + 2x + 3$ (the $x^{10}$ becomes 1, and $x^6 = x$).

To prove the DFT product computes a product in this particular ring, note how the convolution theorem produces the following formula, where $\textup{fprod}(f, g)$ denotes the process of taking the Fourier transform of the two coefficient lists, multiplying them entrywise, and taking a (properly normalized) inverse FFT, and $\textup{fprod}(f, g)(j)$ is the $j$-th coefficient of the output polynomial:

\[ \textup{fprod}(f, g)(j) = \sum_{k=0}^{N-1} f_k g_{j-k \textup{ mod } N} \]

In words, the output polynomial coefficient $j$ equals the sum of all products of pairs of coefficients whose indices sum to $j$ when considered “wrapping around” $N$. Fixing $j=1$ as an example, $\textup{fprod}(f, g)(1) = f_0 g_1 + f_1g_0 + f_2 g_{N-1} + f_3 g_{N-2} + \dots$. This demonstrates the “set $x^N = 1$” interpretation above: the term $f_2 g_{N-1}$ corresponds to the product $f_2x^2 \cdot g_{N-1}x^{N-1}$, which contributes to the $x^1$ term of the polynomial product if and only if $x^{2 + N-1} = x$, if and only if $x^N = 1$.

To achieve this in code, we simply use the version of the code from the program gallery piece, but fix the size of the arrays given to numpy.fft.fft in advance. We will also, for simplicity, assume the $N$ one wishes to use is a power of 2. The resulting code is significantly simpler than the original program gallery code (we omit zero-padding to length $N$ for brevity).

import numpy
from numpy.fft import fft, ifft

def cyclic_polymul(p1, p2, N):
    """Multiply two integer polynomials modulo (x^N - 1).

    p1 and p2 are arrays of coefficients in degree-increasing order.
    """
    assert len(p1) == len(p2) == N
    product = fft(p1) * fft(p2)
    inverted = ifft(product)
    return numpy.round(numpy.real(inverted)).astype(numpy.int32)

As a side note, there’s nothing that stops this from working with polynomials that have real or complex coefficients, but so long as we use small magnitude integer coefficients and round at the end, I don’t have to worry about precision issues (hat tip to Brad Lucier for suggesting an excellent paper by Colin Percival, “Rapid multiplication modulo the sum and difference of highly composite numbers“, which covers these precision issues in detail).

Negacyclic polynomials, DFT with duplication

Now the kind of polynomial quotient ring that shows up in cryptography is critically not $\mathbb{Z}[x]/(x^N-1)$, because that ring has enough easy-to-reason-about structure that it can’t hide secrets. Instead, cryptographers use the ring $\mathbb{Z}[x]/(x^N+1)$ (the minus becomes a plus), which is believed to be more secure for cryptography—although I don’t have a great intuitive grasp on why.

The interpretation is similar here as before, except we “set” $x^N = -1$ instead of $x^N = 1$ in our reductions. Repeating the above example, if $N=5$ then $x^{10} + x^6 – x^4 + x + 2 = -x^4 + 3$ (the $x^{10}$ becomes $(-1)^2 = 1$, and $x^6 = -x$). It’s called negacyclic because as a term $x^k$ passes $k \geq N$, it cycles back to $x^0 = 1$, but with a sign flip.

The negacyclic polynomial multiplication can’t use the DFT without some special hacks. The first and simplest hack is to double the input lists with a negation. That is, starting from $f(x) \in \mathbb{Z}[x]/(x^N+1)$, we can define $f^*(x) = f(x) – x^Nf(x)$ in a different ring $\mathbb{Z}[x]/(x^{2N} – 1)$ (and similarly for $g^*$ and $g$).

Before seeing how this causes the DFT to (almost) compute a negacyclic polynomial product, some math wizardry. The ring $\mathbb{Z}[x]/(x^{2N} – 1)$ is special because it contains our negacyclic ring as a subring. Indeed, because the polynomial $x^{2N} – 1$ factors as $(x^N-1)(x^N+1)$, and because these two factors are coprime in $\mathbb{Z}[x]/(x^{2N} – 1)$, the Chinese remainder theorem (aka Sun-tzu’s theorem) generalizes to polynomial rings and says that any polynomial in $\mathbb{Z}[x]/(x^{2N} – 1)$ is uniquely determined by its remainders when divided by $(x^N-1)$ and $(x^N+1)$. Another way to say it is that the ring $\mathbb{Z}[x]/(x^{2N} – 1)$ factors as a direct product of the two rings $\mathbb{Z}[x]/(x^{N} – 1)$ and $\mathbb{Z}[x]/(x^{N} + 1)$.

Now mapping a polynomial $f(x)$ from the bigger ring $(x^{2N} – 1)$ to the smaller ring $(x^{N}+1)$ involves taking a remainder of $f(x)$ when dividing by $x^{N}+1$ (“setting” $x^N = -1$ and reducing). There are many possible preimage mappings, depending on what your goal is. In this case, we actually intentionally choose a non preimage mapping, because in general to compute a preimage requires solving a system of congruences in the larger polynomial ring. So instead we choose $f(x) \mapsto f^*(x) = f(x) – x^Nf(x) = -f(x)(x^N – 1)$, which maps back down to $2f(x)$ in $\mathbb{Z}[x]/(x^{N} + 1)$. This preimage mapping has a particularly nice structure, in that you build it by repeating the polynomial’s coefficients twice and flipping the sign of the second half. It’s easy to see that the product $f^*(x) g^*(x)$ maps down to $4f(x)g(x)$.

So if we properly account for these extra constant factors floating around, our strategy to perform negacyclic polynomial multiplication is to map $f$ and $g$ up to the larger ring as described, compute their cyclic product (modulo $x^{2N} – 1$) using the FFT, and then the result should be a degree $2N-1$ polynomial which can be reduced with one more modular reduction step to the right degree $N-1$ negacyclic product, i.e., setting $x^N = -1$, which materializes as taking the second half of the coefficients, flipping their signs, and adding them to the corresponding coefficients in the first half.

The code for this is:

def negacyclic_polymul_preimage_and_map_back(p1, p2):
    p1_preprocessed = numpy.concatenate([p1, -p1])
    p2_preprocessed = numpy.concatenate([p2, -p2])
    product = fft(p1_preprocessed) * fft(p2_preprocessed)
    inverted = ifft(product)
    rounded = numpy.round(numpy.real(inverted)).astype(p1.dtype)
    return (rounded[: p1.shape[0]] - rounded[p1.shape[0] :]) // 4

However, this chosen mapping hides another clever trick. The product of the two preimages has enough structure that we can “read” the result off without doing the full “set $x^N = -1$” reduction step. Mapping $f$ and $g$ up to $f^*, g^*$ and taking their product modulo $(x^{2N} – 1)$ gives

\[ \begin{aligned} f^*g^* &= -f(x^N-1) \cdot -g(x^N – 1) \\ &= fg (x^N-1)^2 \\ &= fg(x^{2N} – 2x^N + 1) \\ &= fg(2 – 2x^N) \\ &= 2(fg – x^Nfg) \end{aligned} \]

This has the same syntactical format as the original mapping $f \mapsto f – x^Nf$, with an extra factor of 2, and so its coefficients also have the form “repeat the coefficients and flip the sign of the second half” (times two). We can then do the “inverse mapping” by reading only the first half of the coefficients and dividing by 2.

def negacyclic_polymul_use_special_preimage(p1, p2):
    p1_preprocessed = numpy.concatenate([p1, -p1])
    p2_preprocessed = numpy.concatenate([p2, -p2])
    product = fft(p1_preprocessed) * fft(p2_preprocessed)
    inverted = ifft(product)
    rounded = numpy.round(0.5 * numpy.real(inverted)).astype(p1.dtype)
    return rounded[: p1.shape[0]]

Our chosen mapping $f \mapsto f-x^Nf$ is not particularly special, except that it uses a small number of pre and post-processing operations. For example, if you instead used the mapping $f \mapsto 2f + x^Nf$ (which would map back to $f$ exactly), then the FFT product would result in $5fg + 4x^Nfg$ in the larger ring. You can still read off the coefficients as before, but you’d have to divide by 5 instead of 2 (which, the superstitious would say, is harder). It seems that “double and negate” followed by “halve and take first half” is the least amount of pre/post processing possible.

Negacyclic polynomials with a “twist”

The previous section identified a nice mapping (or embedding) of the input polynomials into a larger ring. But studying that shows some symmetric structure in the FFT output. I.e., the coefficients of $f$ and $g$ are repeated twice, with some scaling factors. It also involves taking an FFT of two $2N$-dimensional vectors when we start from two $N$-dimensional vectors.

This sort of situation should make you think that we can do this more efficiently, either by using a smaller size FFT or by packing some data into the complex part of the input, and indeed we can do both.

[Aside: it’s well known that if all the entries of an FFT input are real, then the result also has symmetry that can be exploted for efficiency by reframing the problem as a size-N/2 FFT in some cases, and just removing half the FFT algorithm’s steps in other cases, see Wikipedia for more]

This technique was explained in Fast multiplication and its applications (pdf link) by Daniel Bernstein, a prominent cryptographer who specializes in cryptography performance, and whose work appears in widely-used standards like TLS, OpenSSH, and he designed a commonly used elliptic curve for cryptography.

[Aside: Bernstein cites this technique as using something called the “Tangent FFT (pdf link).” This is a drop-in FFT replacement he invented that is faster than previous best (split-radix FFT), and Bernstein uses it mainly to give a precise expression for the number of operations required to do the multiplication end to end. We will continue to use the numpy FFT implementation, since in this article I’m just focusing on how to express negacyclic multiplication in terms of the FFT. Also worth noting both the Tangent FFT and “Fast multiplication” papers frame their techniques—including FFT algorithm implementations!—in terms of polynomial ring factorizations and mappings. Be still, my beating cardioid.]

In terms of polynomial mappings, we start from the ring $\mathbb{R}[x] / (x^N + 1)$, where $N$ is a power of 2. We then pick a reversible mapping from $\mathbb{R}[x]/(x^N + 1) \to \mathbb{C}[x]/(x^{N/2} – 1)$ (note the field change from real to complex), apply the FFT to the image of the mapping, and reverse appropriately it at the end.

One such mapping takes two steps, first mapping $\mathbb{R}[x]/(x^N + 1) \to \mathbb{C}[x]/(x^{N/2} – i)$ and then from $\mathbb{C}[x]/(x^{N/2} – i) \to \mathbb{C}[x]/(x^{N/2} – 1)$. The first mapping is as easy as the last section, because $(x^N + 1) = (x^{N/2} + i) (x^{N/2} – i)$, and so we can just set $x^{N/2} = i$ and reduce the polynomial. This as the effect of making the second half of the polynomial’s coefficients become the complex part of the first half of the coefficients.

The second mapping is more nuanced, because we’re not just reducing via factorization. And we can’t just map $i \mapsto 1$ generically, because that would reduce complex numbers down to real values. Instead, we observe that (momentarily using an arbitrary degree $k$ instead of $N/2$), for any polynomial $f \in \mathbb{C}[x]$, the remainder of $f \mod x^k-i$ uniquely determines the remainder of $f \mod x^k – 1$ via the change of variables $x \mapsto \omega_{4k} x$, where $\omega_{4k}$ is a $4k$-th primitive root of unity $\omega_{4k} = e^{\frac{2 \pi i}{4k}}$. Spelling this out in more detail: if $f(x) \in \mathbb{C}[x]$ has remainder $f(x) = g(x) + h(x)(x^k – i)$ for some polynomial $h(x)$, then

\[ \begin{aligned} f(\omega_{4k}x) &= g(\omega_{4k}x) + h(\omega_{4k}x)((\omega_{4k}x)^{k} – i) \\ &= g(\omega_{4k}x) + h(\omega_{4k}x)(e^{\frac{\pi i}{2}} x^k – i) \\ &= g(\omega_{4k}x) + i h(\omega_{4k}x)(x^k – 1) \\ &= g(\omega_{4k}x) \mod (x^k – 1) \end{aligned} \]

Translating this back to $k=N/2$, the mapping from $\mathbb{C}[x]/(x^{N/2} – i) \to \mathbb{C}[x]/(x^{N/2} – 1)$ is $f(x) \mapsto f(\omega_{2N}x)$. And if $f = f_0 + f_1x + \dots + f_{N/2 – 1}x^{N/2 – 1}$, then the mapping involves multiplying each coefficient $f_k$ by $\omega_{2N}^k$.

When you view polynomials as if they were a simple vector of their coefficients, then this operation $f(x) \mapsto f(\omega_{k}x)$ looks like $(a_0, a_1, \dots, a_n) \mapsto (a_0, \omega_{k} a_1, \dots, \omega_k^n a_n)$. Bernstein calls the operation a twist of $\mathbb{C}^n$, which I mused about in this Mathstodon thread.

What’s most important here is that each of these transformations are invertible. The first because the top half coefficients end up in the complex parts of the polynomial, and the second because the mapping $f(x) \mapsto f(\omega_{2N}^{-1}x)$ is an inverse. Together, this makes the preprocessing and postprocessing exact inverses of each other. The code is then

def negacyclic_polymul_complex_twist(p1, p2):
    n = p2.shape[0]
    primitive_root = primitive_nth_root(2 * n)
    root_powers = primitive_root ** numpy.arange(n // 2)

    p1_preprocessed = (p1[: n // 2] + 1j * p1[n // 2 :]) * root_powers
    p2_preprocessed = (p2[: n // 2] + 1j * p2[n // 2 :]) * root_powers

    p1_ft = fft(p1_preprocessed)
    p2_ft = fft(p2_preprocessed)
    prod = p1_ft * p2_ft
    ifft_prod = ifft(prod)
    ifft_rotated = ifft_prod * primitive_root ** numpy.arange(0, -n // 2, -1)

    return numpy.round(
        numpy.concatenate([numpy.real(ifft_rotated), numpy.imag(ifft_rotated)])
    ).astype(p1.dtype)

And so, at the cost of a bit more pre- and postprocessing, we can negacyclically multiply two degree $N-1$ polynomials using an FFT of length $N/2$. In theory, no information is wasted and this is optimal.

And finally, a simple matrix multiplication

The last technique I wanted to share is not based on the FFT, but it’s another method for doing negacyclic polynomial multiplication that has come in handy in situations where I am unable to use FFTs. I call it the Toeplitz method, because one of the polynomials is converted to a Toeplitz matrix. Sometimes I hear it referred to as a circulant matrix technique, but due to the negacyclic sign flip, I don’t think it’s a fully accurate term.

The idea is to put the coefficients of one polynomial $f(x) = f_0 + f_1x + \dots + f_{N-1}x^{N-1}$ into a matrix as follows:

\[ \begin{pmatrix} f_0 & -f_{N-1} & \dots & -f_1 \\ f_1 & f_0 & \dots & -f_2 \\ \vdots & \vdots & \ddots & \vdots \\ f_{N-1} & f_{N-2} & \dots & f_0 \end{pmatrix} \]

The polynomial coefficients are written down in the first column unchanged, then in each subsequent column, the coefficients are cyclically shifted down one, and the term that wraps around the top has its sign flipped. When the second polynomial is treated as a vector of its coefficients, say, $g(x) = g_0 + g_1x + \dots + g_{N-1}x^{N-1}$, then the matrix-vector product computes their negacyclic product (as a vector of coefficients):

\[ \begin{pmatrix} f_0 & -f_{N-1} & \dots & -f_1 \\ f_1 & f_0 & \dots & -f_2 \\ \vdots & \vdots & \ddots & \vdots \\ f_{N-1} & f_{N-2} & \dots & f_0 \end{pmatrix} \begin{pmatrix} g_0 \\ g_1 \\ \vdots \\ g_{N-1} \end{pmatrix} \]

This works because each row $j$ corresponds to one output term $x^j$, and the cyclic shift for that row accounts for the degree-wrapping, with the sign flip accounting for the negacyclic part. (If there were no sign attached, this method could be used to compute a cyclic polynomial product).

The Python code for this is

def cylic_matrix(c: numpy.array) -> numpy.ndarray:
    """Generates a cyclic matrix with each row of the input shifted.

    For input: [1, 2, 3], generates the following matrix:

        [[1 2 3]
         [2 3 1]
         [3 1 2]]
    """
    c = numpy.asarray(c).ravel()
    a, b = numpy.ogrid[0 : len(c), 0 : -len(c) : -1]
    indx = a + b
    return c[indx]


def negacyclic_polymul_toeplitz(p1, p2):
    n = len(p1)

    # Generates a sign matrix with 1s below the diagonal and -1 above.
    up_tri = numpy.tril(numpy.ones((n, n), dtype=int), 0)
    low_tri = numpy.triu(numpy.ones((n, n), dtype=int), 1) * -1
    sign_matrix = up_tri + low_tri

    cyclic_matrix = cylic_matrix(p1)
    toeplitz_p1 = sign_matrix * cyclic_matrix
    return numpy.matmul(toeplitz_p1, p2)

Obviously on most hardware this would be less efficient than an FFT-based method (and there is some relationship between circulant matrices and Fourier Transforms, see Wikipedia). But in some cases—when the polynomials are small, or one of the two polynomials is static, or a particular hardware choice doesn’t handle FFTs with high-precision floats very well, or you want to take advantage of natural parallelism in the matrix-vector product—this method can be useful. It’s also simpler to reason about.

Until next time!

Polynomial Multiplication Using the FFT

Problem: Compute the product of two polynomials efficiently.

Solution:

import numpy
from numpy.fft import fft, ifft


def poly_mul(p1, p2):
    """Multiply two polynomials.

    p1 and p2 are arrays of coefficients in degree-increasing order.
    """
    deg1 = p1.shape[0] - 1
    deg2 = p1.shape[0] - 1
    # Would be 2*(deg1 + deg2) + 1, but the next-power-of-2 handles the +1
    total_num_pts = 2 * (deg1 + deg2)
    next_power_of_2 = 1 << (total_num_pts - 1).bit_length()

    ff_p1 = fft(numpy.pad(p1, (0, next_power_of_2 - p1.shape[0])))
    ff_p2 = fft(numpy.pad(p2, (0, next_power_of_2 - p2.shape[0])))
    product = ff_p1 * ff_p2
    inverted = ifft(product)
    rounded = numpy.round(numpy.real(inverted)).astype(numpy.int32)
    return numpy.trim_zeros(rounded, trim='b')

Discussion: The Fourier Transform has a lot of applications to science, and I’ve covered it on this blog before, see the Signal Processing section of Main Content. But it also has applications to fast computational mathematics.

The naive algorithm for multiplying two polynomials is the “grade-school” algorithm most readers will already be familiar with (see e.g., this page), but for large polynomials that algorithm is slow. It requires $O(n^2)$ arithmetic operations to multiply two polynomials of degree $n$.

This short tip shows a different approach, which is based on the idea of polynomial interpolation. As a side note, I show the basic theory of polynomial interpolation in chapter 2 of my book, A Programmer’s Introduction to Mathematics, along with an application to cryptography called “Secret Sharing.”

The core idea is that given $n+1$ distinct evaluations of a polynomial $p(x)$ (i.e., points $(x, p(x))$ with different $x$ inputs), you can reconstruct the coefficients of $p(x)$ exactly. And if you have two such point sets for two different polynomials $p(x), q(x)$, a valid point set of the product $(pq)(x)$ is the product of the points that have the same $x$ inputs.

\[ \begin{aligned} p(x) &= \{ (x_0, p(x_0)), (x_1, p(x_1)), \dots, (x_n, p(x_n)) \} \\ q(x) &= \{ (x_0, q(x_0)), (x_1, q(x_1)), \dots, (x_n, q(x_n)) \} \\ (pq)(x) &= \{ (x_0, p(x_0)q(x_0)), (x_1, p(x_1)q(x_1)), \dots, (x_n, p(x_n)q(x_n)) \} \end{aligned} \]

The above uses $=$ loosely to represent that the polynomial $p$ can be represented by the point set on the right hand side.

So given two polynomials $p(x), q(x)$ in their coefficient forms, one can first convert them to their point forms, multiply the points, and then reconstruct the resulting product.

The problem is that the two conversions, both to and from the coefficient form, are inefficient for arbitrary choices of points $x_0, \dots, x_n$. The trick comes from choosing special points, in such a way that the intermediate values computed in the conversion steps can be reused. This is where the Fourier Transform comes in: choose $x_0 = \omega_{N}$, the complex-N-th root of unity, and $x_k = \omega_N^k$ as its exponents. $N$ is required to be large enough so that $\omega_N$’s exponents have at least $2n+1$ distinct values required for interpolating a degree-at-most-$2n$ polynomial, and because we’re doing the Fourier Transform, it will naturally be “the next largest power of 2” bigger than the degree of the product polynomial.

Then one has to observe that, by its very formula, the Fourier Transform is exactly the evaluation of a polynomial at the powers of the $N$-th root of unity! In formulas: if $a = (a_0, \dots, a_{n-1})$ is a list of real numbers define $p_a(x) = a_0 + a_1x + \dots + a_{n-1}x^{n-1}$. Then $\mathscr{F}(a)(k)$, the Fourier Transform of $a$ at index $k$, is equal to $p_a(\omega_n^k)$. These notes by Denis Pankratov have more details showing that the Fourier Transform formula is a polynomial evaluation (see Section 3), and this YouTube video by Reducible also has a nice exposition. This interpretation of the FT as polynomial evaluation seems to inspire quite a few additional techniques for computing the Fourier Transform that I plan to write about in the future.

The last step is to reconstruct the product polynomial from the product of the two point sets, but because the Fourier Transform is an invertible function (and linear, too), the inverse Fourier Transform does exactly that: given a list of the $n$ evaluations of a polynomial at $\omega_n^k, k=0, \dots, n-1$, return the coefficients of the polynomial.

This all fits together into the code above:

  1. Pad the input coefficient lists with zeros, so that the lists are a power of 2 and at least 1 more than the degree of the output product polynomial.
  2. Compute the FFT of the two padded polynomials.
  3. Multiply the result pointwise.
  4. Compute the IFFT of the product.
  5. Round the resulting (complex) values back to integers.

Hey, wait a minute! What about precision issues?

They are a problem when you have large numbers or large polynomials, because the intermediate values in steps 2-4 can lose precision due to the floating point math involved. This short note of Richard Fateman discusses some of those issues, and two paths forward include: deal with it somehow, or use an integer-exact analogue called the Number Theoretic Transform (which itself has issues I’ll discuss in a future, longer article).

Postscript: I’m not sure how widely this technique is used. It appears the NTL library uses the polynomial version of Karatsuba multiplication instead (though it implements FFT elsewhere). However, I know for sure that much software involved in doing fully homomorphic encryption rely on the FFT for performance reasons, and the ones that don’t instead use the NTT.

The Two-Dimensional Fourier Transform and Digital Watermarking

We’ve studied the Fourier transform quite a bit on this blog: with four primers and the Fast Fourier Transform algorithm under our belt, it’s about time we opened up our eyes to higher dimensions.

Indeed, in the decades since Cooley & Tukey’s landmark paper, the most interesting applications of the discrete Fourier transform have occurred in dimensions greater than 1. But for all our work we haven’t yet discussed what it means to take an “n-dimensional” Fourier transform. Our past toiling and troubling will pay off, though, because the higher Fourier transform and its 1-dimensional cousin are quite similar. Indeed, the shortest way to describe the $ n$-dimensional transform is as the 1-dimensional transform with inner products of vector variables replacing regular products of variables.

In this post we’ll flush out these details. We’ll define the multivariable Fourier transform and it’s discrete partner, implement an algorithm to compute it (FFT-style), and then apply the transform to the problem of digitally watermarking images.

As usual, all the code, images, and examples used in this post are available on this blog’s Github page.

Sweeping Some Details Under the Rug

We spent our first and second primers on Fourier analysis describing the Fourier series in one variable, and taking a limit of the period to get the Fourier transform in one variable. By all accounts, it was a downright mess of notation and symbol manipulation that culminated in the realization that the Fourier series looks a lot like a Riemann sum. So it was in one dimension, it is in arbitrary dimension, but to save our stamina for the applications we’re going to treat the $ n$-dimensional transform differently. We’ll use the 1-dimensional transform as a model, and magically generalize it to operate on a vector-valued variable. Then the reader will take it on faith that we could achieve the same end as a limit of some kind of multidimensional Fourier series (and all that nonsense with Schwarz functions and tempered distributions is left to the analysts), or if not we’ll provide external notes with the full details.

So we start with a real-valued (or complex-valued) function $ f : \mathbb{R}^n \to \mathbb{R}$, and we write the variable as $ x = (x_1, \dots, x_n)$, so that we can stick to using the notation $ f(x)$. Rather than think of the components of $ x$ as “time variables” as we did in the one-dimensional case, we’ll usually think of $ x$ as representing physical space. And so the periodic behavior of the function $ f$ represents periodicity in space. On the other hand our transformed variables will be “frequency” in space, and this will correspond to a vector variable $ \xi = (\xi_1, \dots, \xi_n)$. We’ll come back to what the heck “periodicity in space” means momentarily.

Remember that in one dimension the Fourier transform was defined by

$ \displaystyle \mathscr{F}f(s) = \int_{-\infty}^\infty e^{-2\pi ist}f(t) dt$.

And it’s inverse transform was

$ \displaystyle \mathscr{F}^{-1}g(t) = \int_{-\infty}^\infty e^{2\pi ist}f(s) ds$.

Indeed, with the vector $ x$ replacing $ t$ and $ \xi$ replacing $ s$, we have to figure out how to make an analogous definition. The obvious thing to do is to take the place where $ st$ is multiplied and replace it with the inner product of $ x$ and $ \xi$, which for this post I’ll write $ x \cdot \xi$ (usually I write $ \left \langle x, \xi \right \rangle$). This gives us the $ n$-dimensional transform

$ \displaystyle \mathscr{F}f(\xi) = \int_{\mathbb{R}^n} e^{-2\pi i x \cdot \xi}f(x) dx$,

and its inverse

$ \displaystyle \mathscr{F}^{-1}g(t) = \int_{\mathbb{R}^n} e^{2\pi i x \cdot \xi}f( \xi ) d \xi$

Note that the integral is over all of $ \mathbb{R}^n$. To give a clarifying example, if we are in two dimensions we can write everything out in coordinates: $ x = (x_1, x_2), \xi = (\xi_1, \xi_2)$, and the formula for the transform becomes

$ \displaystyle \mathscr{F}f(\xi_1, \xi_2) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-2 \pi i (x_1 \xi_1 + x_2 \xi_2)} f(\xi_1, \xi_2) dx_1 dx_2$.

Now that’s a nasty integral if I’ve ever seen one. But for our purposes in this post, this will be as nasty as it gets, for we’re primarily concerned with image analysis. So representing things as vectors of arbitrary dimension is more compact, and we don’t lose anything for it.

Periodicity in Space? It’s All Mostly the Same

Because arithmetic with vectors and arithmetic with numbers is so similar, it turns out that most of the properties of the 1-dimensional Fourier transform hold in arbitrary dimension. For example, the duality of the Fourier transform and its inverse holds, because for vectors $ e^{-2 \pi i x \cdot (-\xi)} = e^{2 \pi i x \cdot \xi}$. So just like in on dimension, we have

$ \mathscr{F}f(-\xi) = \mathscr{F}^{-1}f(\xi)$

And again we have correspondences between algebraic operations: convolution in the spatial domain corresponds to convolution in the frequency domain, the spectrum is symmetric about the origin, etc.

At a more geometric level, though, the Fourier transform does the same sort of thing as it did in the one-dimensional case. Again the complex exponentials form the building blocks of any function we want, and performing a Fourier transform on an $ n$-dimensional function decomposes that function into its frequency components. So a function that is perfectly periodic corresponds to a Fourier spectrum that’s perfectly concentrated at a point.

But what the hell, the reader might ask, is ‘periodicity in space’? Since we’re talking about images anyway, the variables we care about (the coordinates of a pixel) are spatial variables. You could, if you were so inclined, have a function of multiple time variables, and to mathematicians a physical interpretation of dimension is just that, an interpretation. But as confusing as it might sound, it’s actually not so hard to understand the Fourier transform when it’s specialized to image analysis. The idea is that complex exponentials $ e^{\pm 2 \pi i s \cdot \xi}$ oscillate in the $ x$ variable for a fixed $ \xi$ (and since $ \mathscr{F}$ has $ \xi$ as its input, we do want to fix $ \xi$). The brief mathematical analysis goes like this: if we fix $ \xi$ then the complex exponential is periodic with magnitudinal peaks along parallel lines spaced out at a distance of $ 1/ \left \| \xi \right \|$ apart. In particular any image is a sum of a bunch of these “complex exponential with a fixed $ \xi$” images that look like stripes with varying widths and orientations (what you see here is just the real part of a particular complex exponential).

Any image can be made from a sum of a whole lot of images like this one. This corresponds to a single point in the Fourier spectrum.

Any image can be made from a sum of a whole lot of images like the ones on top. They correspond to single points in the Fourier spectrum (and their symmetries), as on bottom.

What you see on top is an image, and on bottom its Fourier spectrum. That is, each brightly colored pixel corresponds to a point $ [x_1, x_2]$ with a large magnitude for that frequency component $ |\mathscr{F}f[x_1, x_2]|$.

It might be a bit surprising that every image can be constructed as a sum of stripey things, but so was it that any sound can be constructed as a sum of sines and cosines. It’s really just a statement about a basis of some vector space of functions. The long version of this story is laid out beautifully in pages 4 – 7 of these notes. The whole set of notes is wonderful, but this section is mathematically tidy and needs no background; the remainder of the notes outline the details about multidimensional Fourier series mentioned earlier, as well as a lot of other things. In higher dimensions the “parallel lines” idea is much the same, but with lines replaced by hyperplanes normal to the given vector.

Discretizing the Transform

Recall that for a continuous function $ f$ of one variable, we spent a bit of time figuring out how to find a good discrete approximation of $ f$, how to find a good discrete approximation of the Fourier transform $ \mathscr{F}f$, and how to find a quick way to transition between the two. In brief: $ f$ was approximated by a vector of samples $ (f[0], f[1], \dots, f[N])$, reconstructed the original function (which was only correct at the sampled points) and computed the Fourier transform of that, calling it the discrete Fourier transform, or DFT. We got to this definition, using square brackets to denote list indexing (or vector indexing, whatever):

Definition: Let $ f = (f[1], \dots f[N])$ be a vector in $ \mathbb{R}^N$. Then the discrete Fourier transform of $ f$ is defined by the vector $ (\mathscr{F}f[1], \dots, \mathscr{F}f[N])$, where

$ \displaystyle \mathscr{F}f[j] = \sum_{k=0}^{N-1} f[k]e^{-2 \pi i jk/N}$

Just as with the one-dimensional case, we can do the same analysis and arrive at a discrete approximation of an $ n$-dimensional function. Instead of a vector it would be an $ N \times N \times \dots \times N$ matrix, where there are $ n$ terms in the matrix, one for each variable. In two dimensions, this means the discrete approximation of a function is a matrix of samples taken at evenly-spaced intervals in both directions.

Sticking with two dimensions, the Fourier transform is then a linear operator taking matrices to matrices (which is called a tensor if you want to scare people). It has its own representation like the one above, where each term is a double sum. In terms of image analysis, we can imagine that each term in the sum requires us to look at every pixel of the original image

Definition: Let $ f = (f[s,t])$ be a vector in $ \mathbb{R}^N \times \mathbb{R}^M$, where $ s$ ranges from $ 0, \dots, N-1$ and $ t$ from $ 0, \dots, M-1$. Then the discrete Fourier transform of $ f$ is defined by the vector $ (\mathscr{F}f[s,t])$, where each entry is given by

$ \displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s, t] e^{-2 \pi i (s x_1 / N + t x_2 / M)}$

In the one-dimensional case the inverse transform had a sign change in the exponent and an extra $ 1/N$ normalization factor. Similarly, in two dimensions the inverse transform has a normalization factor of $ 1/NM$ (1 over the total number of samples). Again we use a capital $ F$ to denote the transformed version of $ f$. The higher dimensional transforms are analogous: you get $ n$ sums, one for each component, and the normalization factor is the inverse of the total number of samples.

$ \displaystyle \mathscr{F}^{-1}F[x_1, x_2] = \frac{1}{NM} \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s,t] e^{2 \pi i (sx_1 / N + tx_2 / M)}$

Unfortunately, the world of the DFT disagrees a lot on the choice of normalization factor. It turns out that all that really matters is that the exponent is negated in the inverse, and that the product of the constant terms on both the transform and its inverse is $ 1/NM$. So some people will normalize both the Fourier transform and its inverse by $ 1/ \sqrt{NM}$. The reason for this is that it makes the transform and its inverse more similar-looking (it’s just that, cosmetic). The choice of normalization isn’t particularly important for us, but beware: non-canonical choices are out there, and they do affect formulas by adding multiplicative constants.

The Fast Fourier Transform, Revisited

Now one might expect that there is another clever algorithm to drastically reduce the runtime of the 2-dimensional DFT, akin to the fast Fourier transform algorithm (FFT). But actually there is almost no additional insight required to understand the “fast” higher dimensional Fourier transform algorithm, because all the work was done for us in the one dimensional case.

All that we do is realize that each of the inner summations is a 1-dimensional DFT. That is, if we write the inner-most sum as a function of two parameters

$ \displaystyle g(s, x_2) = \sum_{t=0}^{M-1} f(s,t) e^{-2 \pi i (tx_2 / M)}$

then the 2-dimensional FFT is simply

$ \displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} g(s, x_2) e^{-2 \pi i (sx_1/N)}$

But now notice, that we can forget that $ g(s,x_2)$ was ever a separate, two-dimensional function. Indeed, since it only depends on the $ x_2$ parameter from out of the sum this is precisely the formula for a 1-dimensional DFT! And so if we want to compute the 2-dimensional DFT using the 1-dimensional FFT algorithm, we can compute the matrix of 1-dimensional DFT entries for all choices of $ s, x_2$ by fixing each value of $ s$ in turn and running FFT on the resulting “column” of values. If you followed the program from our last FFT post, then the only difficulty is in understanding how the data is shuffled around and which variables are fixed during the computation of the sub-DFT’s.

To remedy the confusion, we give an example. Say we have the following 3×3 matrix whose DFT we want to compute. Remember, these values are the sampled values of a 2-variable function.

$ \displaystyle \begin{pmatrix} f[0,0] & f[0,1] & f[0,2] \\ f[1,0] & f[1,1] & f[1,2] \\ f[2,0] & f[2,1] & f[2,2] \end{pmatrix}$

The first step in the algorithm is to fix a choice of row, $ s$, and compute the DFT of the resulting row. So let’s fix $ s = 0$, and then we have the resulting row

$ \displaystyle f_0 = (f[0,0], f[0,1], f[0,2])$

It’s DFT is computed (intentionally using the same notation as the inner summation above), as

$ \displaystyle g[0,x_2] = (\mathscr{F}f_0)[x_2] = \sum_{t=0}^{M-1} f_0[t] e^{- 2 \pi i (t x_2 / M)}$

Note that $ f_0[t] = f[s,t]$ for our fixed choice of $ s=0$. And so if we do this for all $ N$ rows (all 3 rows, in this example), we’ll have performed $ N$ FFT’s of size $ M$ to get a matrix of values

$ \displaystyle \begin{pmatrix} g[0,0] & g[0,1] & g[0,2] \\ g[1,0] & g[1,1] & g[1,2] \\ g[2,0] & g[2,1] & g[2,2] \end{pmatrix}$

Now we want to compute the rest of the 2-dimensional DFT to the end, and it’s easy: now each column consists of the terms in the outermost sum above (since $ s$ is the iterating variable). So if we fix a value of $ x_2$, say $ x_2 = 1$, we get the resulting column

$ \displaystyle g_1 = (g[0, 1], g[1,1], g[2,1])$

and computing a DFT on this row gives

$ \displaystyle \mathscr{F}f[x_1, 1] = \sum_{s=0}^{N-1} g_1[s] e^{-2 \pi i sx_1 / N}$.

Expanding the definition of $ g$ as a DFT gets us back to the original formula for the 2-dimensional DFT, so we know we did it right. In the end we get a matrix of the computed DFT values for all $ x_1, x_2$.

Let’s analyze the runtime of this algorithm: in the first round of DFT’s we computed $ N$ DFT’s of size $ M$, requiring a total of $ O(N M \log M)$, since we know FFT takes time $ O(M \log M)$ for a list of length $ M$. In the second round we did it the other way around, computing $ M$ DFT’s of size $ N$ each, giving a total of

$ O(NM \log M + NM \log N) = O(NM (\log N + \log M)) = O(NM \log (NM))$

In other words, if the size of the image is $ n = NM$, then we are achieving an $ O(n \log n)$-time algorithm, which was precisely the speedup that the FFT algorithm gave us for one-dimension. We also know a lower bound on this problem: we can’t do better than $ NM$ since we have to look at every pixel at least once. So we know that we’re only a logarithmic factor away from a trivial lower bound. And indeed, all other known DFT algorithms have the same runtime. Without any assumptions on the input data (or any parallelization), nobody knows of a faster algorithm.

Now let’s turn to the code. If we use our FFT algorithm from last time, the pure Python one (read: very slow), then we can implement the 2D Fourier transform in just two lines of Python code. Full disclosure: we left out some numpy stuff in this code for readability. You can view the entire source file on this blog’s Github page.

def fft2d(matrix):
   fftRows = [fft(row) for row in matrix]
   return transpose([fft(row) for row in transpose(fftRows)])

And we can test it on a simple matrix with one nonzero value in it:

A = [[0,0,0,0], [0,1,0,0], [0,0,0,0], [0,0,0,0]]
for row in fft2d(A):
   print(', '.join(['%.3f + %.3fi' % (x.real, x.imag) for x in row]))

The output is (reformatted in LaTeX, obviously):

$ \displaystyle \begin{pmatrix} 1 & -i & -1 & i \\ -i & -1 & i & 1 \\ -1 & i & 1 & -i \\ i & 1 & -i & -1 \end{pmatrix}$

The reader can verify by hand that this is correct (there’s only one nonzero term in the double sum, so it just boils down to figuring out the complex exponential $ e^{2 \pi i (x_1 + x_2 / 4)}$). We leave it as an additional exercise to the reader to implement the inverse transform, as well as to generalize this algorithm to higher dimensional DFTs.

Some Experiments and Animations

As we did with the 1-dimensional FFT, we’re now going to switch to using an industry-strength FFT algorithm for the applications. We’ll be using the numpy library and its “fft2” function, along with scipy’s ndimage module for image manipulation. Getting all of this set up was a nightmare (thank goodness for people who guide users like me through this stuff, but even then the headache seemed unending!). As usual, all of the code and images used in the making of this post is available on this blog’s Github page.

And so we can start playing with a sample image, a still from one of my favorite television shows:

sherlock

The Fourier transform of this image (after we convert it to grayscale) can be computed in python:

def fourierSpectrumExample(filename):
   A = ndimage.imread(filename, flatten=True)
   unshiftedfft = numpy.fft.fft2(A)
   spectrum = numpy.log10(numpy.absolute(unshiftedfft) + numpy.ones(A.shape))
   misc.imsave(&quot;%s-spectrum-unshifted.png&quot; % (filename.split('.')[0]), spectrum)

With the result:

sherlock-spectrum-unshifted

The Fourier spectrum of Sherlock and Watson (and London).

A few notes: we use the ndimage library to load the image and flatten the colors to grayscale. Then, after we compute the spectrum, we shift and take a logarithm. This is because the raw spectrum values are too massive; plotting them without modification makes the image contrast too high.

Something is odd, though, because the brightest regions are on the edges of the image, where we might expect the highest-frequency elements to be. Actually, it turns out that a raw DFT (as computed by numpy, anyhow) is “shifted.” That is, the indices are much like they were in our original FFT post, so that the “center” of the spectrum (the lowest frequency component) is actually in the corner of the image array.

The numpy folks have a special function designed to alleviate this called fftshift. Applying it before we plot the image gives the following spectrum:

sherlock-spectrum

Now that’s more like it. For more details on what’s going on with shifting and how to use the shifting functions, see this matlab thread. (As a side note, the “smudges” in this image are interesting. We wonder what property of the original image contributes to the smudges)

Shifted or unshifted, this image represents the frequency spectrum of the image. In other words, we could take the inverse DFT of each pixel (and its symmetric partner) of this image separately, add them all together, and get back to our original image! We did just that using a different image (one of size 266 x 189, requiring a mere 25137 frequency components), to produce this video of the process:

Many thanks to James Hance for his relentlessly cheerful art (I have a reddish version of this particular masterpiece on my bedroom wall).

For the interested reader, I followed this youtube video’s recommended workflow to make the time-lapsed movie, along with some additional steps to make the videos play side by side. It took quite a while to generate and process the images, and the frames take up a lot of space. So instead of storing all the frames, the interested reader may find the script used to generate the frames on this blog’s Github page (along with all of the rest of the code used in this blog post).

Digital Watermarking

Now we turn to the main application of Fourier transforms to this post, the task of adding an invisible digital watermark to an image. Just in case the reader lives in a cave, a watermark is a security device used to protect the ownership or authenticity of a particular good. Usually they’re used on money to prevent counterfeits, but they’re often applied to high-resolution images on the web to protect copyrights. But perhaps more than just protect existing copyrights, watermarks as they’re used today are ugly, and mostly prevent people from taking the image (paid for or not) in the first place. Here’s an example from a big proponent of ugly watermarks, Shutterstock.com.

stock-photo

Now if you were the business of copyright litigation, you’d make a lot of money by suing people who took your clients’ images without permission. So rather than prevent people from stealing in the first place, you could put in an invisible watermark into all of your images and then crawl the web looking for stolen images with your watermark. It would be easy enough to automate (Google already did most of the work for you, if you just want to use Google’s search by image feature).

Now I’m more on the side of Fair Use For All, so I wouldn’t hope for a company to actually implement this and make using the internet that much scarier of a place. But the idea makes for an interesting thought experiment and blog post. The idea is simply to modify the spectrum of an image by adding in small, artificial frequency components. That is, the watermarked image will look identical to the original image to a human, but the Fourier spectrum will contain suspicious entries that we can extract if we know where to look.

Implementing the watermarking feature is quite easy, so let’s do that first. Let’s work again with James Hance’s fine artwork.

hance-up-sw-bw

Let’s call our image’s pixel matrix $ A$ and say we’re working with grayscale images for simplicity (for color, we just do the same thing to all three color channels). Then we can define a watermark matrix $ W$ by the following procedure:

  1. Pick a radius $ r$, a length $ L$, a watermark strength $ \alpha$, and a secret key $ k$.
  2. Using $ k$ as a seed to a random number generator, define a random binary vector $ v$ of length $ L$.
  3. Pick a subset $ S$ of the circle of coordinates centered at the image’s center of radius $ r$, chosen or rejected based on the entries of $ v$.
  4. Let $ W$ be the matrix of all zeros (of the same dimension as $ A$ with 1’s in the entries of $ S$.
  5. Compute the watermarked image as $ \mathscr{F}^{-1}(\mathscr{F}(A) + \alpha W)$. That is, compute the DFT of $ A$, add $ \alpha W$ to it, and then compute the inverse Fourier transform of the result.

The code for this is simple enough. To create a random vector:

import random
def randomVector(seed, length):
   random.seed(secretKey)
   return [random.choice([0,1]) for _ in range(length)]

To make the watermark (and flush out all of the technical details of how it’s done:

def makeWatermark(imageShape, radius, secretKey, vectorLength=50):
   watermark = numpy.zeros(imageShape)
   center = (int(imageShape[0] / 2) + 1, int(imageShape[1] / 2) + 1)

   vector = randomVector(secretKey, vectorLength)

   x = lambda t: center[0] + int(radius * math.cos(t * 2 * math.pi / vectorLength))
   y = lambda t: center[1] + int(radius * math.sin(t * 2 * math.pi / vectorLength))
   indices = [(x(t), y(t)) for t in range(vectorLength)]

   for i,location in enumerate(indices):
      watermark[location] = vector[i]

   return watermark

We use the usual parameterization of the circle as $ t \mapsto (\cos(2 \pi t / n), \sin(2 \pi t / n)$ scaled to the appropriate radius. Here’s what the watermark looks like as a spectrum:

watermark-spectrum

It’s hard to see the individual pixels, so click it to enlarge.

And then applying a given watermark to an image is super simple.

def applyWatermark(imageMatrix, watermarkMatrix, alpha):
   shiftedDFT = fftshift(fft2(imageMatrix))
   watermarkedDFT = shiftedDFT + alpha * watermarkMatrix
   watermarkedImage = ifft2(ifftshift(watermarkedDFT))

   return watermarkedImage

And that’s all there is to it! One might wonder how the choice of $ \alpha$ affects the intensity of the watermark, and indeed here we show a few example values of this method applied to Hance’s piece:

Click to enlarge. It appears that it's not until alpha becomes egregiously large that we notice the effects. This could be in part due to the fact that this is an image of a canvas (which has lots of small textures in the background).

Click to enlarge. The effects are most visible in the rightmost image where alpha = 1,000,000

It appears that it’s not until $ \alpha$ becomes egregiously large (over 10,000) that we visibly notice the effects. This could be in part due to the fact that this is an image of a canvas (which has lots of small textures in the background). But it’s good to keep in mind the range of acceptable values when designing a decoding mechanism.

Indeed, a decoding mechanism is conceptually much messier; it’s the art to the encoding mechanism’s science. This paper details one possible way to do it, which is essentially to scale everything up or down to 512×512 pixels and try circles of every possible radius until you find one (or don’t) which is statistically similar to the your random vector. And note that since we have the secret key we can generate the exact same random vector. So what the author of that paper suggests is to extract each circle of pixels from the Fourier spectrum, treating it as a single vector with first entry at angle 0. Then you do some statistical magic (compute cross-correlation or some other similarity measure) between the extracted pixels and your secret-key-generated random vector. If they’re sufficiently similar, then you’ve found your watermark, and otherwise there’s no watermark present.

The code required to do this only requires a few extra lines that aren’t present in the code we’re already presented in this article (numpy does cross-correlation for you), so we leave it as an exercise to the reader: write a program that determines if an image contains our watermark, and test the algorithm on various $ \alpha$ and with modifications of the image like rotation, scaling, cropping, and jpeg compression. Part of the benefit of Fourier-based techniques is the resilience of the spectrum to mild applications of these transformations.

Next time we’ll use the Fourier transform to do other cool things to images, like designing filters and combining images in interesting ways.

Until then!

The Fast Fourier Transform

John Tukey, one of the developers of the Cooley-Tukey FFT algorithm.

It’s often said that the Age of Information began on August 17, 1964 with the publication of Cooley and Tukey’s paper, “An Algorithm for the Machine Calculation of Complex Fourier Series.” They published a landmark algorithm which has since been called the Fast Fourier Transform algorithm, and has spawned countless variations. Specifically, it improved the best known computational bound on the discrete Fourier transform from $ O(n^2)$ to $ O(n \log n)$, which is the difference between uselessness and panacea.

Indeed, their work was revolutionary because so much of our current daily lives depends on efficient signal processing. Digital audio and video, graphics, mobile phones, radar and sonar, satellite transmissions, weather forecasting, economics and medicine all use the Fast Fourier Transform algorithm in a crucial way. (Not to mention that electronic circuits wouldn’t exist without Fourier analysis in general.) Before the Fast Fourier Transform algorithm was public knowledge, it simply wasn’t feasible to process digital signals.

Amusingly, Cooley and Tukey’s particular algorithm was known to Gauss around 1800 in a slightly different context; he simply didn’t find it interesting enough to publish, even though it predated the earliest work on Fourier analysis by Joseph Fourier himself.

In this post we will derive and implement a Fast Fourier Transform algorithm, and explore a (perhaps naive) application to audio processing. In particular, we will remove white noise from a sound clip by filtering the frequency spectrum of a noisy signal.

As usual, all of the resources used for this post are available on this blog’s Github page.

Derivation

It turns out that there are a number of different ways to speed up the naive Fourier Transform computation. As we saw in our primer on the discrete Fourier transform, the transform itself can be represented as a matrix. With a bit of nontrivial analysis, one can factor the Fourier transform matrix into a product of two sparse matrices (i.e., which contain mostly zeros), and from there the operations one can skip are evident. The intuitive reason why this should work is that the Fourier transform matrix is very structured, and the complex exponential has many useful properties. For more information on this method, see these lecture notes (p. 286).

We will take a different route which historically precedes the matrix factorization method. In fact, the derivation we’ll trace through below was Cooley and Tukey’s original algorithm. The method itself is a classical instance of a divide and conquer algorithm. Familiar programmers would recognize the ideas in common sorting algorithms: both mergesort and quicksort are divide and conquer algorithms.

The difficulty here is to split a list of numbers into two lists which are half in size in such a way that the Fourier transforms of the smaller pieces can be quickly combined to form the Fourier transform of the original list. Once we accomplish that, the rest of the algorithm falls out from recursion; further splitting each piece into two smaller pieces, we will eventually reach lists of length two or one (in which case the Fourier transform is completely trivial).

For now, we’ll assume that the length of the list is a power of 2. That is,

$ f = (f[1], f[2], \dots, f[n]), n = 2^k$ for some $ k$.

We will also introduce the somewhat helpful notation for complex exponentials. Specifically, $ \omega[p,q] = e^{2\pi i q/p}$. Here the $ p$ will represent the length of the list, and $ q$ will be related to the index. In particular, the complex exponential that usually shows up in the discrete Fourier transform (again, refer to our primer) is $ \omega[n, -km] = e^{-2 \pi i mk/n}$. We write the discrete Fourier transform of $ f$ by specifying its $ m$-th entry:

$ \displaystyle \mathscr{F}f[m] = \sum_{k=0}^{n-1} f[k]\omega[n, -km]$

Now the trick here is to split up the sum into two lists each of half size. Of course, one could split it up in many different ways, but after we split it we need to be able to rewrite the pieces as discrete Fourier transforms themselves. It turns out that if we split it into the terms with even indices and the terms with odd indices, things work out nicely.

Specifically, we can write the sum above as the two sums below. The reader should study this for a moment (or better yet, try to figure out what it should be without looking) to ensure that the indices all line up. The notation grows thicker ahead, so it’s good practice.

$ \displaystyle \mathscr{F}f[m] = \sum_{k=0}^{\frac{n}{2} – 1} f[2k] \omega[n,-2km] + \sum_{k=0}^{\frac{n}{2} – 1} f[2k+1]\omega[n,-(2k+1)m]$.

Now we need to rewrite these pieces as Fourier transforms. Indeed, we must replace the occurrences of $ n$ in the complex exponentials with $ n/2$. This is straightforward in the first summation, since

$ \omega[n, -2km] = e^{-2 \pi i km/n} = \omega[\frac{n}{2}, -km]$.

For the second summation, we observe that

$ \displaystyle \omega[n, -(2k+1)m] = \omega[n, -2km] \cdot \omega[n,-m]$.

Now if we factor out the $ \omega[n,-m]$, we can transform the second sum in the same way as the first, but with that additional factor out front. In other words, we now have

$ \displaystyle \mathscr{F}f[m] = \sum_{k=0}^{\frac{n}{2}-1} f[2k] \omega[n/2, -km] + \omega[n, -m] \sum_{k=0}^{\frac{n}{2}-1}f[2k+1] \omega[n/2, -km]$.

If we denote the list of even-indexed entries of $ f$ by $ f_{\textup{even}}$, and vice versa for $ f_{\textup{odd}}$, we see that this is just a combination of two Fourier transforms of length $ n/2$. That is,

$ \displaystyle \mathscr{F}f = \mathscr{F}f_{\textup{even}} + \omega[n,-m] \mathscr{F}f_{\textup{odd}}$

But we have a big problem here. Computer scientists will recognize this as a type error. The thing on the left hand side of the equation is a list of length $ n$, while the thing on the right hand side is a sum of two lists of length $ n/2$, and so it has length $ n/2$. Certainly it is still true for values of $ m$ which are less than $ n/2$; we broke no algebraic laws in the way we rewrote the sum (just in the use of the $ \mathscr{F}$ notation).

Recalling our primer on the discrete Fourier transform, we naturally extended the signals involved to be periodic. So indeed, the length-$ n/2$ Fourier transforms satisfy the following identity for each $ m$.

$ \mathscr{F}f_{\textup{even}}[m] = \mathscr{F}f_{\textup{even}}[m+n/2] \\ \mathscr{F}f_{\textup{odd}}[m] = \mathscr{F}f_{\textup{odd}}[m+n/2]$

However, this does not mean we can use the same formula above! Indeed, for a length $ n$ Fourier transform, it is not true in general that $ \mathscr{F}f[m] = \mathscr{F}f[m+n/2]$. But the correct formula is close to it. Plugging in $ m + n/2$ to the summations above, we have

$ \displaystyle \mathscr{F}f[m+n/2] = \sum_{k=0}^{\frac{n}{2} – 1}f[2k] \omega[n/2, -(m+n/2)k] + \\ \omega[n, -(m+n/2)] \sum_{k=0}^{\frac{n}{2} – 1}f[2k+1] \omega[n/2, -(m+n/2)k]$

Now we can use the easy-to-prove identity

$ \displaystyle \omega[n/2, -(m+n/2)k] = \omega[n/2, -mk] \omega[n/2, -kn/2]$

And see that the right-hand-term is simply $ e^{2 \pi i k} = 1$. Similarly, one can trivially prove the identity

$ \displaystyle \omega[n, -(m+n/2)] = \omega[n, -m] \omega[n, -n/2] = -\omega[n, -m]$,

this simplifying the massive formula above to the more familiar

$ \displaystyle \mathscr{F}f[m + n/2] = \sum_k f[2k]\omega[n/2, -mk] – \omega[n, -m] \sum_k f[2k+1] \omega[n/2, -mk]$

Now finally, we have the Fast Fourier Transform algorithm expressed recursively as:

$ \displaystyle \mathscr{F}f[m] = \mathscr{F}f_{\textup{even}}[m] + \omega[n,-m] \mathscr{F}f_{\textup{odd}}[m]$
$ \displaystyle \mathscr{F}f[m+n/2] = \mathscr{F}f_{\textup{even}}[m] – \omega[n,-m] \mathscr{F}f_{\textup{odd}}[m]$

With the base case being $ \mathscr{F}([a]) = [a]$.

In Python

With all of that notation out of the way, the implementation is quite short. First, we should mention a few details about complex numbers in Python. Much to this author’s chagrin, Python represents $ i$ using the symbol 1j. That is, in python, $ a + bi$ is

a + b * 1j

Further, Python reserves a special library for complex numbers, the cmath library. So we implement the omega function above as follows.

import cmath

def omega(p, q):
   return cmath.exp((2.0 * cmath.pi * 1j * q) / p)

And then the Fast Fourier Transform algorithm is more or less a straightforward translation of the mathematics above:

def fft(signal):
   n = len(signal)
   if n == 1:
      return signal
   else:
      Feven = fft([signal[i] for i in xrange(0, n, 2)])
      Fodd = fft([signal[i] for i in xrange(1, n, 2)])

      combined = [0] * n
      for m in xrange(n/2):
         combined[m] = Feven[m] + omega(n, -m) * Fodd[m]
         combined[m + n/2] = Feven[m] - omega(n, -m) * Fodd[m]

      return combined

Here I use the awkward variable names “Feven” and “Fodd” to be consistent with the notation above. Note that this implementation is highly non-optimized. There are many ways to improve the code, most notably using a different language and cacheing the complex exponential computations. In any event, the above code is quite readable, and a capable programmer could easily speed this up by orders of magnitude.

Of course, we should test this function on at least some of the discrete signals we investigated in our primer. And indeed, it functions correctly on the unshifted delta signal

>>> fft([1,0,0,0])
[(1+0j), (1+0j), (1+0j), (1+0j)]
>>> fft([1,0,0,0,0,0,0,0])
[(1+0j), (1+0j), (1+0j), (1+0j), (1+0j), (1+0j), (1+0j), (1+0j)]

As well as the shifted verion (up to floating point roundoff error)

>>> fft([0,1,0,0])
[(1+0j), 
 (6.123233995736766e-17-1j), 
 (-1+0j), 
 (-6.123233995736766e-17+1j)]
>>> fft([0,1,0,0,0,0,0,0])
[(1+0j), 
 (0.7071067811865476-0.7071067811865475j), 
 (6.123233995736766e-17-1j), 
 (-0.7071067811865475-0.7071067811865476j), 
 (-1+0j), 
 (-0.7071067811865476+0.7071067811865475j), 
 (-6.123233995736766e-17+1j), 
 (0.7071067811865475+0.7071067811865476j)]

And testing it on various other shifts gives further correct outputs. Finally, we test the Fourier transform of the discrete complex exponential, which the reader will recall is a scaled delta.

>>> w = cmath.exp(2 * cmath.pi * 1j / 8)
>>> w
(0.7071067811865476+0.7071067811865475j)
>>> d = 4
>>> fft([w**(k*d) for k in range(8)])
[                           1.7763568394002505e-15j, 
 (-7.357910944937894e-16  + 1.7763568394002503e-15j), 
 (-1.7763568394002505e-15 + 1.7763568394002503e-15j), 
 (-4.28850477329429e-15   + 1.7763568394002499e-15j), 
 (8                       - 1.2434497875801753e-14j), 
 (4.28850477329429e-15    + 1.7763568394002505e-15j), 
 (1.7763568394002505e-15  + 1.7763568394002505e-15j), 
 (7.357910944937894e-16   + 1.7763568394002505e-15j)]

Note that $ n = 8$ occurs in the position with index $ d=4$, and all of the other values are negligibly close to zero, as expected.

So that’s great! It works! Unfortunately it’s not quite there in terms of usability. In particular, we require our signal to have length a power of two, and most signals don’t happen to be that long. It turns out that this is a highly nontrivial issue, and all implementations of a discrete Fourier transform algorithm have to compensate for it in some way.

The easiest solution is to simply add zeros to the end of the signal until it is long enough, and just work with everything in powers of two. This technique (called zero-padding) is only reasonable if the signal in question is actually zero outside of the range it’s sampled from. Otherwise, subtle and important things can happen. We’ll leave further investigations to the reader, but the gist of the idea is that the Fourier transform assumes periodicity of the data one gives it, and so padding with zeros imposes a kind of periodicity that is simply nonexistent in the actual signal.

Now, of course not every Fast Fourier transform uses zero-padding. Unfortunately the techniques to evaluate a non-power-of-two Fourier transform are relatively complex, and beyond the scope of this post (though not beyond the scope of this blog). The interested reader should investigate the so-called Chirp Z-transform, as discovered by Rader, et al. in 1968. We may cover it here in the future.

In any case, the code above is a good starting point for any technique, and as usual it is available on this blog’s Github page. Finally, the inverse transform has a simple implementation, since it can be represented in terms of the forward transform (hint: remember duality!). We leave the code as an exercise to the reader.

Experiments with Sound — I am no Tree!

We’ll manipulate a clip of audio from Treebeard, a character from Lord of the Rings.

For the remainder of this post we’ll use a more established Fast Fourier Transform algorithm from the Python numpy library. The reasons for this are essentially convenience. Being implemented in C and brandishing the full might of the literature on Fourier transform algorithms, the numpy implementation is lightning fast. Now, note that the algorithm we implemented above is still correct (if one uses zero padding)! The skeptical reader may run the code to verify this. So we are justified in abandoning our implementation until we decide to seriously focus on improving its speed.

So let’s play with sound.

The sound clip we’ll be using is the following piece of dialog from the movie The Lord of the Rings: The Two Towers. We include the original wav file with the code on this blog’s Github page.

If we take the Fourier transform of the sound sample, we get the following plot of the frequency spectrum. Recall, the frequency spectrum is the graph of the norm of the frequency values.

The frequency spectrum of an Ent speaking. The x-axis represents the index in the list (larger values correspond to larger frequencies), and the y-axis corresponds to intensity (larger values correspond to that frequency having a greater presence in the original signal). There is symmetry about the ~80,000 index, and we may consider the right half ‘negative’ frequencies because the Fourier Transform is periodic modulo its length.

Here we note that there is a symmetry to the graph. This is not a coincidence: if the input signal is real-valued, it will always be the case that the Fourier transform is symmetric about its center value. The reason for this goes back to our first primer on the Fourier series, in that the negative coefficients were complex conjugates of the positive ones. In any event, we only need concern ourselves with the first half of the values.

We will omit the details for doing file I/O with audio. The interested reader can inspect our source code, and they will find a very basic use of the scikits.audiolab library, and matplotlib for plotting the frequency spectrum.

Now, just for fun, let’s tinker with this audio bite. First we’ll generate some random noise in the signal. That is, let’s mutate the input signal as follows:

import random
inputSignal = [x/2.0 + random.random()*0.1 for x in inputSignal]

This gives us a sound bite with an annoying fuzziness in the background:

Next, we will use the Fourier transform to remove some of this noise. Of course, since the noise is random it inhabits all the frequencies to some degree, so we can’t eliminate it entirely. Furthermore, as the original audio clip uses some of the higher frequencies, we will necessarily lose some information in the process. But in the real world we don’t have access to the original signal, so we should clean it up as best we can without losing too much information.

To do so, we can plot the frequencies for the noisy signal:

Comparing this with our original graph (cheating, yes, but the alternative is to guess and check until we arrive at the same answer), we see that the noise begins to dominate the spectrum at around the 20,000th index. So, we’ll just zero out all of those frequencies.

def frequencyFilter(signal):
   for i in range(20000, len(signal)-20000):
      signal[i] = 0

And voila! The resulting audio clip, while admittedly damaged, has only a small trace of noise:

Inspecting the source code, one should note that at one point we halve the amplitude of the input signal (i.e., in the time domain). The reason for this is if one arbitrarily tinkers with the frequencies of the Fourier transform, it can alter the amplitude of the original signal. As one quickly discovers, playing the resulting signal as a wav file can create an unpleasant crackling noise. The amplitudes are clipped (or wrapped around) by the audio software or hardware for a reason which is entirely mysterious to this author.

In any case, this is clearly not the optimal way (or even a good way) to reduce white noise in a signal. There are certainly better methods, but for the sake of time we will save them for future posts. The point is that we were able to implement and use the Fast Fourier Transform algorithm to analyze the discrete Fourier transform of a real-world signal, and manipulate it in logical ways. That’s quite the milestone considering where we began!

Next up in this series we’ll investigate more techniques on sound processing and other one-dimensional signals, and we’ll also derive a multi-dimensional Fourier transform so that we can start working with images and other two-dimensional signals.

Until then!