(Finite) Fields — A Primer

So far on this blog we’ve given some introductory notes on a few kinds of algebraic structures in mathematics (most notably groups and rings, but also monoids). Fields are the next natural step in the progression.

If the reader is comfortable with rings, then a field is extremely simple to describe: they’re just commutative rings with 0 and 1, where every nonzero element has a multiplicative inverse. We’ll give a list of all of the properties that go into this “simple” definition in a moment, but an even more simple way to describe a field is as a place where “arithmetic makes sense.” That is, you get operations for +,-, \cdot , / which satisfy the expected properties of addition, subtraction, multiplication, and division. So whatever the objects in your field are (and sometimes they are quite weird objects), they behave like usual numbers in a very concrete sense.

So here’s the official definition of a field. We call a set F a field if it is endowed with two binary operations addition (+) and multiplication (\cdot, or just symbol juxtaposition) that have the following properties:

  • There is an element we call 0 which is the identity for addition.
  • Addition is commutative and associative.
  • Every element a \in F has a corresponding additive inverse b (which may equal a) for which a + b = 0.

These three properties are just the axioms of a (commutative) group, so we continue:

  • There is an element we call 1 (distinct from 0) which is the identity for multiplication.
  • Multiplication is commutative and associative.
  • Every nonzero element a \in F has a corresponding multiplicative inverse b (which may equal a) for which ab = 1.
  • Addition and multiplication distribute across each other as we expect.

If we exclude the existence of multiplicative inverses, these properties make F a commutative ring, and so we have the following chain of inclusions that describes it all

\displaystyle \textup{Fields} \subset \textup{Commutative Rings} \subset \textup{Rings} \subset \textup{Commutative Groups} \subset \textup{Groups}

The standard examples of fields are the real numbers \mathbb{R}, the rationals \mathbb{Q}, and the complex numbers \mathbb{C}. But of course there are many many more. The first natural question to ask about fields is: what can they look like?

For example, can there be any finite fields? A field F which as a set has only finitely many elements?

As we saw in our studies of groups and rings, the answer is yes! The simplest example is the set of integers modulo some prime p. We call them \mathbb{Z} / p \mathbb{Z}, or sometimes just \mathbb{Z}/p for short, and let’s rederive what we know about them now.

As a set, \mathbb{Z}/p consists of the integers \left \{ 0, 1, \dots, p-1 \right \}. The addition and multiplication operations are easy to define, they’re just usual addition and multiplication followed by a modulus. That is, we add by a + b \mod p and multiply with ab \mod p. This thing is clearly a commutative ring (because the integers form a commutative ring), so to show this is a field we need to show that everything has a multiplicative inverse.

There is a nice fact that allows us to do this: an element a has an inverse if and only if the only way for it to divide zero is the trivial way 0a = 0. Here’s a proof. For one direction, suppose a divides zero nontrivially, that is there is some c \neq 0 with ac = 0. Then if a had an inverse b, then 0 = b(ac) = (ba)c = c, but that’s very embarrassing for c because it claimed to be nonzero. Now suppose a only divides zero in the trivial way. Then look at all possible ways to multiply a by other nonzero elements of F. No two can give you the same result because if ax = ay then (without using multiplicative inverses) a(x-y) = 0, but we know that a can only divide zero in the trivial way so x=y. In other words, the map “multiplication by a” is injective. Because the set of nonzero elements of F is finite you have to hit everything (the map is in fact a bijection), and some x will give you ax = 1.

Now let’s use this fact on \mathbb{Z}/p in the obvious way. Since p is a prime, there are no two smaller numbers a, b < p so that ab = p. But in \mathbb{Z}/p the number p is equivalent to zero (mod p)! So \mathbb{Z}/p has no nontrivial zero divisors, and so every element has an inverse, and so it’s a finite field with p elements.

The next question is obvious: can we get finite fields of other sizes? The answer turns out to be yes, but you can’t get finite fields of any size. Let’s see why.

Characteristics and Vector Spaces

Say you have a finite field k (lower-case k is the standard letter for a field, so let’s forget about F). Beacuse the field is finite, if you take 1 and keep adding it to itself you’ll eventually run out of field elements. That is, n = 1 + 1 + \dots + 1 = 0 at some point. How do I know it’s zero and doesn’t keep cycling never hitting zero? Well if at two points n = m \neq 0, then n-m = 0 is a time where you hit zero, contradicting the claim.

Now we define \textup{char}(k), the characteristic of k, to be the smallest n (sums of 1 with itself) for which n = 0. If there is no such n (this can happen if k is infinite, but doesn’t always happen for infinite fields), then we say the characteristic is zero. It would probably make more sense to say the characteristic is infinite, but that’s just the way it is. Of course, for finite fields the characteristic is always positive. So what can we say about this number? We have seen lots of example where it’s prime, but is it always prime? It turns out the answer is yes!

For if ab = n = \textup{char}(k) is composite, then by the minimality of n we get a,b \neq 0, but ab = n = 0. This can’t happen by our above observation, because being a zero divisor means you have no inverse! Contradiction, sucker.

But it might happen that there are elements of k that can’t be written as 1 + 1 + \dots + 1 for any number of terms. We’ll construct examples in a minute (in fact, we’ll classify all finite fields), but we already have a lot of information about what those fields might look like. Indeed, since every field has 1 in it, we just showed that every finite field contains a smaller field (a subfield) of all the ways to add 1 to itself. Since the characteristic is prime, the subfield is a copy of \mathbb{Z}/p for p = \textup{char}(k). We call this special subfield the prime subfield of k.

The relationship between the possible other elements of k and the prime subfield is very neat. Because think about it: if k is your field and F is your prime subfield, then the elements of k can interact with F just like any other field elements. But if we separate k from F (make a separate copy of F), and just think of k as having addition, then the relationship with F is that of a vector space! In fact, whenever you have two fields k \subset k', the latter has the structure of a vector space over the former.

Back to finite fields, k is a vector space over its prime subfield, and now we can impose all the power and might of linear algebra against it. What’s it’s dimension? Finite because k is a finite set! Call the dimension m, then we get a basis v_1, \dots, v_m. Then the crucial part: every element of k has a unique representation in terms of the basis. So they are expanded in the form

\displaystyle f_1v_1 + \dots + f_mv_m

where the f_i come from F. But now, since these are all just field operations, every possible choice for the f_i has to give you a different field element. And how many choices are there for the f_i? Each one has exactly |F| = \textup{char}(k) = p. And so by counting we get that k has p^m many elements.

This is getting exciting quickly, but we have to pace ourselves! This is a constraint on the possible size of a finite field, but can we realize it for all choices of p, m? The answer is again yes, and in the next section we’ll see how.  But reader be warned: the formal way to do it requires a little bit of familiarity with ideals in rings to understand the construction. I’ll try to avoid too much technical stuff, but if you don’t know what an ideal is, you should expect to get lost (it’s okay, that’s the nature of learning new math!).

Constructing All Finite Fields

Let’s describe a construction. Take a finite field k of characteristic p, and say you want to make a field of size p^m. What we need to do is construct a field extension, that is, find a bigger field containing k so that the vector space dimension of our new field over k is exactly m.

What you can do is first form the ring of polynomials with coefficients in k. This ring is usually denoted k[x], and it’s easy to check it’s a ring (polynomial addition and multiplication are defined in the usual way). Now if I were speaking to a mathematician I would say, “From here you take an irreducible monic polynomial p(x) of degree m, and quotient your ring by the principal ideal generated by p. The result is the field we want!”

In less compact terms, the idea is exactly the same as modular arithmetic on integers. Instead of doing arithmetic with integers modulo some prime (an irreducible integer), we’re doing arithmetic with polynomials modulo some irreducible polynomial p(x). Now you see the reason I used p for a polynomial, to highlight the parallel thought process. What I mean by “modulo a polynomial” is that you divide some element f in your ring by p as much as you can, until the degree of the remainder is smaller than the degree of p(x), and that’s the element of your quotient. The Euclidean algorithm guarantees that we can do this no matter what k is (in the formal parlance, k[x] is called a Euclidean domain for this very reason). In still other words, the “quotient structure” tells us that two polynomials f, g \in k[x] are considered to be the same in k[x] / p if and only if f - g is divisible by p. This is actually the same definition for \mathbb{Z}/p, with polynomials replacing numbers, and if you haven’t already you can start to imagine why people decided to study rings in general.

Let’s do a specific example to see what’s going on. Say we’re working with k = \mathbb{Z}/3 and we want to compute a field of size 27 = 3^3. First we need to find a monic irreducible polynomial of degree 3. For now, I just happen to know one: p(x) = x^3 - x + 1. In fact, we can check it’s irreducible, because to be reducible it would have to have a linear factor and hence a root in \mathbb{Z}/3. But it’s easy to see that if you compute p(0), p(1), p(2) and take (mod 3) you never get zero.

So I’m calling this new ring

\displaystyle \frac{\mathbb{Z}/3[x]}{(x^3 - x + 1)}

It happens to be a field, and we can argue it with a whole lot of ring theory. First, we know an irreducible element of this ring is also prime (because the ring is a unique factorization domain), and prime elements generate maximal ideals (because it’s a principal ideal domain), and if you quotient by a maximal ideal you get a field (true of all rings).

But if we want to avoid that kind of argument and just focus on this ring, we can explicitly construct inverses. Say you have a polynomial f(x), and for illustration purposes we’ll choose f(x) = x^4 + x^2 - 1. Now in the quotient ring we could do polynomial long division to find remainders, but another trick is just to notice that the quotient is equivalent to the condition that x^3 = x - 1. So we can reduce f(x) by applying this rule to x^4 = x^3 x to get

\displaystyle f(x) = x^2 + x(x-1) - 1 = 2x^2 - x - 1

Now what’s the inverse of f(x)? Well we need a polynomial g(x) = ax^2 + bx + c whose product with f gives us something which is equivalent to 1, after you reduce by x^3 - x + 1. A few minutes of algebra later and you’ll discover that this is equivalent to the following polynomial being identically 1

\displaystyle (a-b+2c)x^2 + (-3a+b-c)x + (a - 2b - 2c) = 1

In other words, we get a system of linear equations which we need to solve:

\displaystyle \begin{aligned} a & - & b & + & 2c & = 0 \\ -3a & + & b & - & c &= 0 \\ a & - & 2b & - & 2c &= 1 \end{aligned}

And from here you can solve with your favorite linear algebra techniques. This is a good exercise for working in fields, because you get to abuse the prime subfield being characteristic 3 to say terrifying things like -1 = 2 and 6b = 0. The end result is that the inverse polynomial is 2x^2 + x + 1, and if you were really determined you could write a program to compute these linear systems for any input polynomial and ensure they’re all solvable. We prefer the ring theoretic proof.

In any case, it’s clear that taking a polynomial ring like this and quotienting by a monic irreducible polynomial gives you a field. We just control the size of that field by choosing the degree of the irreducible polynomial to our satisfaction. And that’s how we get all finite fields!

One Last Word on Irreducible Polynomials

One thing we’ve avoided is the question of why irreducible monic polynomials exist of all possible degrees m over any \mathbb{Z}/p (and as a consequence we can actually construct finite fields of all possible sizes).

The answer requires a bit of group theory to prove this, but it turns out that the polynomial x^{p^m} - x has all degree m monic irreducible polynomials as factors. But perhaps a better question (for computer scientists) is how do we work over a finite field in practice? One way is to work with polynomial arithmetic as we described above, but this has some downsides: it requires us to compute these irreducible monic polynomials (which doesn’t sound so hard, maybe), to do polynomial long division every time we add, subtract, or multiply, and to compute inverses by solving a linear system.

But we can do better for some special finite fields, say where the characteristic is 2 (smells like binary) or we’re only looking at F_{p^2}. The benefit there is that we aren’t forced to use polynomials. We can come up with some other kind of structure (say, matrices of a special form) which happens to have the same field structure and makes computing operations relatively painless. We’ll see how this is done in the future, and see it applied to cryptography when we continue with our series on elliptic curve cryptography.

Until then!

The Discrete Fourier Transform — A Primer

So here we are. We have finally made it to a place where we can transition with confidence from the classical continuous Fourier transform to the discrete version, which is the foundation for applications of Fourier analysis to programming. Indeed, we are quite close to unfurling the might of the Fast Fourier Transform algorithm, which efficiently computes the discrete Fourier transform. But because of its focus on algorithmic techniques, we will save it for a main content post and instead focus here on the intuitive connections between the discrete and continuous realms.

The goal has roughly three parts:

  1. Find a reasonable discrete approximation to a continuous function.
  2. Find a reasonable discrete approximation to the Fourier transform of a continuous function.
  3. Find a way to transition between the two discrete representations.

We should also note that there will be some notational clashes in the sequel. Rigorously, item 3 will result in an operation which we will denote by \mathscr{F}, but will be distinct from the classical Fourier transform on continuous functions. Indeed, they will have algebraic similarities, but one operates on generalized functions, and the other on finite sequences. We will keep the distinction clear from context to avoid adding superfluous notation. Moreover, we will avoid the rigor from the last post on tempered distributions. Instead we simply assume all functions are understood to be distributions and use the classical notation. Again, this is safe because our dabbling with the classical transform will be solely for intuition.

Of course, the point of this entire post is that all of the facts we proved about the continuous Fourier transform have direct translations into the discrete case. Up to a constant factor (sometimes) and up to notational conventions, the two theories will be identical. This is part of the beauty of the subject; sit back and enjoy it.


There is a very nice theorem about classical Fourier transforms that has to do with reconstructing a function from a discrete sample of its points. Since we do not actually need this theorem for anything other than motivation, we will not prove it here. Instead, we’ll introduce the definitions needed to state it, and see why it motivates a good definition for the discrete “approximation” to a function. For a much more thorough treatment of the sampling theorem and the other issues we glaze over in this post, see these lecture notes.

Definition: A function f is time limited if it has bounded support. A function f is bandlimited if its Fourier transform has bounded support. That is, if there is some B so that the Fourier transform of f is only nonzero when |s|<B. We call B the bandwidth of f.

To be honest, before seeing a mathematical treatment of signal processing, this author had no idea what bandwidth actually referred to. It’s nice to finally solve those life mysteries.

In any case, it often occurs that one has a signal f for which one can only measure values, but one doesn’t have access to an exact description of f. The sampling theorem allows us to reconstruct f exactly by choosing certain sample points. In a simplified form, the theorem is as follows:

TheoremSuppose f is a function of bandwidth B. Then one can reconstruct f exactly from the set of points (k/2B, f(k/2B)) as k ranges over \mathbb{Z} (that is, the sampling rate is at least 1/2B).

Unsurprisingly, the proof uses the Fourier transform in a nontrivial way. Moreover, there is a similar theorem for the Fourier transform \mathscr{F}f, which can be reconstructed exactly from its samples if the sampling rate is at least 1/L, where L/2 bounds the support of f. Note that the sampling rate in one domain is determined by the limitations on the other domain.

What’s more, if we are daring enough to claim f is both time limited and bandlimited (and we sample as little as possible in each domain), then the number of sample points is the same for both domains. In particular, suppose f(t) is only nonzero on 0 \leq t \leq L and its Fourier transform on 0 \leq s \leq 2B. Note that since f is both timelimited and bandlimited, there is no fault in shifting both so their smallest value is at the origin. Then, if n, m are the respective numbers of sampled points in the time and frequency domains, then m/L = 2B, n/2B = L, and so m = n = 2BL.

Now this gives us a good reason to define the discrete approximation to a signal as

\displaystyle f_{\textup{sampled}} = (f(t_0), f(t_1), \dots, f(t_{n-1})),

where t_k = k/2B. This accomplishes our first task. Now, in order to determine what the discrete Fourier transform should look like, we can represent this discrete approximation as a distribution using shifted deltas:

\displaystyle f_{\textup{sampled}}(t) = \sum_{k=0}^{n-1}f(t_k)\delta(t-t_k)

Here the deltas ensure the function is zero at points not among the t_k, capturing the essence of the finite sequence above. Taking the Fourier transform of this (recalling that the Fourier transform of the shifted delta is a complex exponential and using linearity), we get

\displaystyle \mathscr{F}f_{\textup{sampled}}(s) = \sum_{k=0}^{n-1}f(t_k)e^{-2 \pi i s t_k}

Denote the above function by F. Now F is unfortunately still continuous, so we take the same approach we did for f and sample it via the samples in the frequency domain at s_j = j/L. This gives the following list of values for the discrete approximation to the transformed version of f:

\displaystyle F(s_0) = \sum_{k=0}^{n-1} f(t_k)e^{-2 \pi i s_0t_k}

\displaystyle F(s_1) = \sum_{k=0}^{n-1} f(t_k)e^{-2 \pi i s_1t_k}


\displaystyle F(s_n) = \sum_{k=0}^{n-1} f(t_k)e^{-2 \pi i s_nt_k}

And now the rest falls together. We can denote by (\mathscr{F}f)_{\textup{sampled}} the tuple of values F(s_j), and the list of formulas above gives a way to transition from one domain to the other.

A Discrete Life

Now we move completely away from the continuous realm. The above discussion was nice in giving us an intuition for why the following definitions are reasonable. However, in order to actually use these ideas on discrete sequences, we can’t be restricted to sampling continuous functions. In fact, most of our planned work on this blog will not use discrete approximations to continuous functions, but just plain old sequences. So we just need to redefine the above ideas in a purely discrete realm.

The first step is to get rid of any idea of the sampling rate, and instead write everything in terms of the indices. This boils down to the a convenient coincidence. If t_k = k/2B, s_j = j/L, then by our earlier remarks that 2BL = n (the number of sample points taken), then t_ks_j = kj/n. This allows us to rewrite the complex exponentials as e^{-2 \pi i kj/n}, which is entirely in terms of the number of points used and the indices of the summation.

In other words, we can finally boil all of this discussion down to the following definition. Before we get to it, we should note that we use the square bracket notation for lists. That is, if L is a list, then L[i] is the i-th element of that list. Usually one would use subscripts to denote this, but we’d like to stay close to our notation for the continuous case, at least to make the parallels between the two theories more obvious.

DefinitionLet f = (f[1], \dots f[n]) be a vector in \mathbb{R}^n. Then the discrete Fourier transform of f is defined by the vector (\mathscr{F}f[1], \dots, \mathscr{F}f[n]), where

\displaystyle \mathscr{F}f[j] = \sum_{k=0}^{n-1} f[k]e^{-2 \pi i jk/n}

The phrase “discrete Fourier transform” is often abbreviated to DFT.

Now although we want to showcase the connections between the discrete and continuous Fourier transforms, we should note that they are completely disjoint. If one so desired (and many books follow this route) one could start with the discrete Fourier transform and later introduce the continuous version. We have the advantage of brevity here, in that we know what the theorems should say before we actually prove them. The point is that while the two transforms have connections and similarities, their theory is largely independent, and moreover the definition of the Fourier transform can be given without any preface. Much as we claimed with the continuous Fourier transform, the discrete Fourier transform is simply an operation on lists of data (i.e., on vectors).

Pushing the discrete to the extreme, we can represent the list of complex exponentials as a discrete signal too.

Definition: The discrete complex exponential of period n is the list

\displaystyle \omega_n = (1, e^{2 \pi i/n}, \dots, e^{2 \pi i (n-1)/n}).

We will omit the subscript n when it is understood (at least, for the rest of this primer). And hence \omega[k] = e^{2 \pi i k/n} and moreover \omega^m[k] = \omega^k[m] = e^{2\pi imk/n}. If we denote by \omega^n the vector of entry-wise powers of \omega, then we can write the discrete Fourier transform in its most succinct form as:

\displaystyle \mathscr{F}f[m] = \sum_{k=0}^{n-1} f[k]\omega^{-m}[k]

or, recognizing everything without indexing in “operator notation,”

\displaystyle \mathscr{F}f = \sum_{k=0}^{n-1}f\omega^{-m}.

There are other ways to represent the discrete Fourier transform as an operation. In fact, since we know the classical Fourier transform is a linear operator, we should be able to come up with a matrix representation of the discrete transform. We will do just this, and as a result we will find the inverse discrete Fourier transform, but first we should look at some examples.

The Transforms of Discrete Signals

Perhaps the simplest possible signal one could think of is the delta function, whose discretized form is

\displaystyle \delta = (1,0, \dots, 0)

with the corresponding shifted form

\displaystyle \delta_k = (0, 0, \dots 0, 1, 0, \dots 0)

where the 1 appears in the k-th spot. The reader should be convinced that this is indeed the right definition because it’s continuous cousin’s defining properties translate nicely. Specifically, \sum_{k}\delta_m[k] f[k] = f[m] for all m, and \sum_{k}\delta[k] = 1.

Now to find its Fourier transform, we simply use the definition:

\displaystyle \mathscr{F}\delta[m] = \sum_{k=0}^{n-1}\delta[k] \omega^{-m}[k]

The only time that \delta[k] is nonzero is for k=0, so this is simply the vector \delta[0] \omega^{0}, which is the vector consisting entirely of 1’s. Hence, as in the continuous case (or at least, for the continuous definition of the 1 distribution),

\displaystyle \mathscr{F}\delta = 1

In an identical manner, one can prove that \mathscr{F}\delta_k = \omega^{-k}, just as it was in for the continuous transform.

Now let’s do an example which deviates slightly from the continuous case. Recall that the continuous Fourier transform of the complex exponential was a delta function (to convince the reader, simply see that a single complex exponential can only have a single angular variable, and hence a singular frequency). In the discrete case, we get something similar.

\displaystyle \mathscr{F}\omega^{d} = \sum_{k=0}^{n-1} \omega^d\omega^{-m},

so looking at the m-th entry of this vector, we get

\displaystyle \mathscr{F}\omega^d[m] = \sum_{k=0}^{n-1} \omega^d[k] \omega^{-m}[k]

but because \omega^{-m}[k] = e^{-2 \pi i km/n} = \overline{\omega^m[k]}, we see that the sum is just the usual complex inner product \left \langle \omega^d, \omega^m \right \rangle. Further, as the differing powers of \omega are orthogonal (hint: their complex inner product forms a geometric series), we see that they’re only nonzero when d =m. In this case, it is easy to show that the inner product is precisely n. Summarizing,

This is naturally n \delta_d, so our final statement is just \mathscr{F}\omega^d = n\delta_d. Note that here we have the extra factor of n floating around. The reason for this boils down to the fact that the norm of the complex exponential \omega is \sqrt{n}, and not 1. That is, the powers of \omega do not form an orthonormal basis of \mathbb{C}^n. There are various alternative definitions that try to compensate for this, but to the best of this author’s knowledge a factor of n always shows up in some of the resulting formulas. In other words, we should just accept this deviation from the continuous theory as collateral damage.

The mischievous presence of n also shows up in an interesting way in the inverse discrete Fourier transform.

The DFT and its Inverse, as a Matrix

It’s a trivial exercise to check by hand that the discrete Fourier transform is a linear operation on vectors. i.e., \mathscr{F}(v_1 + v_2)[m] = \mathscr{F}v_1[m] + \mathscr{F}v_2[m] for all vectors v_1, v_2 and all m. As we know from our primer on linear algebra, all such mappings are expressible as matrix multiplication by a fixed matrix.

The “compact” form we represented the discrete Fourier transform above sheds light onto this question. Specifically, the formula

\displaystyle \mathscr{F}f[m] = \sum_{k=0}^{n-1}f[k]\omega^{-m}[k]

Is basically just the definition of a matrix multiplication. Viewing f as the column vector

\displaystyle \begin{pmatrix} f[0]\\ f[1]\\ \vdots\\ f[n] \end{pmatrix}

It is easy to see that the correct matrix to act on this vector is

A perhaps easier way to digest this matrix is by viewing each row of the matrix as the vector \omega^{-m}.

\displaystyle \mathscr{F} = \begin{pmatrix} \textbf{---} & \omega^0 & \textbf{---} \\ \textbf{---} & \omega^{-1} & \textbf{---} \\ & \vdots & \\ \textbf{---} & \omega^{-(n-1)} & \textbf{---} \end{pmatrix} 

Now let’s just admire this matrix for a moment. What started many primers ago as a calculus computation requiring infinite integrals and sometimes divergent answers is now trivial to compute. This is our first peek into how to compute discrete Fourier transforms in a program, but unfortunately we are well aware of the fact that computing this naively requires O(n^2) computations. Our future post on the Fast Fourier Transform algorithm will take advantage of the structure of this matrix to improve this by an order or magnitude to O(n \log n).

But for now, let’s find the inverse Fourier transform as a matrix. From the first of the two matrices above, it is easy to see that the matrix for \mathscr{F} is symmetric. Indeed, the roles of k,m are identical in e^{-2\pi i km/n}. Furthermore, looking at the second matrix, we see that \mathscr{F} is almost unitary. Recall that a matrix A is unitary if AA^* = I, where I is the identity matrix and A^* is the conjugate transpose of A. For the case of \mathscr{F}, we have that \mathscr{FF^*} = nI. We encourage the reader to work this out by hand, noting that each entry in the matrix resulting from \mathscr{FF^*} is an inner product of powers of \omega.

In other words, we have \mathscr{F}\cdot \frac{1}{n}\mathscr{F^*} = I, so that \mathscr{F}^{-1} = \frac{1}{n}\mathscr{F^*}. Since \mathscr{F} is symmetric, this simplifies to \frac{1}{n}\overline{\mathscr{F}}.

Expanding this out to a summation, we get what we might have guessed was the inverse transform:

\displaystyle \mathscr{F}^{-1}f[m] = \frac{1}{n} \sum_{k=0}^{n-1} f[k]\omega^m[k].

The only difference between this formula and our intuition is the factor of 1/n.


The last thing we’d like to mention in this primer is that the wonderful results on duality for the continuous Fourier transform translate to the discrete (again, with a factor of n). While we leave the details as an exercise to the reader,

In order to do this, we need some additional notation. We can think of \omega as an infinite sequence which would repeat itself every n entries (by Euler’s identity). Then we can “index” \omega higher than n, and get the identity \omega^m[k] = \omega[mk]. Similarly, we could “periodize” a discrete signal f so that f[m] is defined by f[m \mod n] and we allow m \in \mathbb{Z}.

This periodization allows us to define f^- naturally as f^-[m] = f[-m \mod n]. Then it becomes straightforward to check that \mathscr{F}f^- = (\mathscr{F}f)^-, and as a corollary \mathscr{FF}f = nf^-. This recovers some of our most powerful tools from the classical case for computing Fourier transforms of specific functions.

Next time (and this has been a long time coming) we’ll finally get to the computational meat of Fourier analysis. We’ll derive and implement the Fast Fourier Transform algorithm, which computes the discrete Fourier transform efficiently. Then we’ll take our first foray into playing with real signals, and move on to higher-dimensional transforms for image analysis.

Until then!

Inner Product Spaces – A Primer

This primer is a precursor to a post on eigenfaces, a technique for facial recognition. For a more basic introduction to linear algebra, see our first primer on the subject.


Vector spaces alone are not enough to do a lot of the interesting things we’d like them to do. Since a vector space is a generalization of Euclidean space, it is natural for us to investigate more specific types of vector spaces which are more akin to Euclidean space. In particular, we want to include the notion of a dot product. By admitting additional structure to a vector space, we may perform more computations, and hopefully get more interesting results.

Again, since we are developing mathematics for use in computing, all vector spaces in this post are finite-dimensional, and the field under question is always \mathbb{R} or \mathbb{C}, but usually the former. It suffices to recognize the vast amount of work and results in infinite-dimensional spaces, but we will not need it here.

Inner Product Spaces

The standard dot product operation in Euclidean space is defined as

\left \langle (x_1, \dots , x_n), (y_1, \dots, y_n) \right \rangle = \sum \limits_{i = 1}^n x_iy_i

So we take the dot product and generalize it to an operation on an arbitrary vector space.

Definition: An inner product on a vector space V over a field F is a function \left \langle \cdot, \cdot \right \rangle : V \times V \to F which satisfies the following properties for all x,y,z \in V, c \in F:

  • \left \langle x,y \right \rangle = \overline{ \left \langle y, x\right \rangle }, where the bar denotes complex conjugate. For real fields, this is just symmetry in the arguments.
  • \left \langle cx, y \right \rangle = c \left \langle x,y \right \rangle
  • \left \langle x+y, z \right \rangle = \left \langle x,z \right \rangle + \left \langle y,z \right \rangle
  • \left \langle x,x \right \rangle \geq 0 and \left \langle x,x \right \rangle = 0 if and only if x=0.

We recommend the novice reader invent some simple and wild operations on two vectors, and confirm or deny that they are inner products.

Notice that the second and third conditions imply that if the second argument of an inner product is fixed, then the resulting map V \to F is a linear map (since it maps vectors to the underlying field, it has a special name: a linear functional). We leave it as an exercise to the reader to investigate linearity facts about the map resulting from fixing the first argument (hint: things get conjugated).

We call a vector space with an associated inner product and inner product space. Of course, the most natural example of an inner product space is any Euclidean space \mathbb{R}^n with the dot product. However, there are many other interesting inner products, including ones which involve matrix multiplication, integration, and random variables. As interesting as they may be (and though the results we develop here hold for them), we have no need for them at this point in time. We will stick entirely to inner product spaces embedded in \mathbb{R}^n, and the standard dot product will suffice.

Now from any inner product we may induce a norm on V, which is colloquially the “distance” of a vector from the origin.

Definition: A norm on V is a function \left \| \cdot \right \| : V \to R which satisfies the following for all x,y \in V, c \in F:

  • \left \| x \right \| \geq 0, with equality if and only if x=0
  • \left \| cx \right \| = |c| \left \| x \right \|
  • \left \| x+y \right \| \leq \left \| x \right \| + \left \| y \right \| (the infamous triangle inequality)

If we recall the standard Euclidean norm, we see that it is just \left \| x \right \| = \sqrt{\left \langle x,x \right \rangle}. Indeed, for any inner product this definition satisfies the axioms of a norm, and so it is a natural generalization of “distance” between points in any inner product space.

In particular, those readers familiar with topology (or at least metric space analysis), will immediately recognize that a norm induces a metric on V, defined by d(x,y)= \left \| x-y \right \| , where of course x-y is the vector between x and y. Hence, every inner product space has a very rigid (metrized) topology and a fruitful geometry.

Additionally, any vector of norm 1 is called a unit vector, or a normal vector.

Now we look at vectors which have interesting relationships under inner products: specifically, when \left \langle x,y \right \rangle = 0.

Orthogonality and Orthonormal Bases

Definition: Two vectors x,y \in V are orthogonal if \left \langle x,y \right \rangle = 0. A set S of vectors is called orthogonal if the vectors in S are pairwise orthogonal.

Orthogonal vectors naturally generalize perpendicularity in Euclidean space. In particular, two vectors in \mathbb{R}^n are orthogonal if and only if the subspaces (lines) spanned by them are perpendicular.

The simplest examples of orthogonal vectors are the standard basis vectors (1,0, \dots, 0) \dots (0, \dots ,1), with respect to the usual dot product. Although, any scalar multiple of a basis vector \lambda e_i may replace e_i while still preserving the orthogonality of the set.

Orthogonality gives a new insight into the nature of inner products. Specifically, \left \langle x,y \right \rangle gives (almost) the length of the projection of x onto y. In other words, \left \langle x,y \right \rangle is the component of x that points in the direction of y (scaled by the length of y).

Now we may define projection maps to get “projection onto y” more faithfully than a plain inner product:

Definition: The projection of x onto y, denoted p_y(x), is the map V \to V defined by p_y(x) = \left \langle x, y \right \rangle \frac{y}{\left \| y \right \|^2}.

The reader may easily verify that the vectors p_y(x) and x-p_y(x) are indeed orthogonal, though the computation is awkward. This will come in handy later when we want to build orthogonal sets of vectors.

In addition, we may obviously decompose any vector v into its two orthogonal components with respect to another vector w via this projection: v = (v-p_w(v)) + p_w(v).

In addition to orthogonality, the standard basis vectors have norm 1. We call such a basis for an inner product space an orthonormal basis (a portmanteau of orthogonal and normal). We commonly denote an orthonormal set of vectors (e_1, \dots, e_n), perhaps a ritualistic attempt to summon the power of the standard basis vectors.

Given an orthonormal basis for an inner product space V (e_1, \dots, e_n), we may decompose any vector into its basis representation rather easily:

\displaystyle v = p_{e_1}(v) + \dots + p_{e_n}(v) = \left \langle v,e_1 \right \rangle e_1 + \dots + \left \langle v,e_n \right \rangle e_n,

Since the norm of each e_i is 1, we may skip the division by the square norm of e_i in the projection maps. Now, recalling that every vector can be written uniquely as a linear combination of the basis vectors, we see that the inner products \left \langle v, e_i \right \rangle are precisely those coefficients. The recognition of this fact leads us to an important theorem, with a necessary preliminary definition:

Definition: Two inner product spaces V, W are isometric if there exists a linear isomorphism f:V \to W which preserves the inner products in the respective spaces. i.e., \left \langle x, y \right \rangle_V = \left \langle f(x), f(y) \right \rangle_W for all x,y \in V.

Whereas, linear isomorphisms between vector spaces are the mathematically rigorous way of saying the two vector spaces are identical, isometric vector spaces give the “sameness” of the inner products. Hence, isometric vector spaces have equivalent metrics, topologies, and geometries, with merely a different choice of basis and representation of the vectors. Now, as we had that every finite-dimensional vector space was isomorphic to \mathbb{R}^n, we will soon see that every finite-dimensional inner product space is isometric to \mathbb{R}^n with the usual dot product.

Theorem: Any finite dimensional real inner product space V which has an orthonormal basis is isometric to \mathbb{R}^n with the usual dot product.

Proof: Define f: \mathbb{R}^n \to V by f((x_1, \dots, x_n)) = x_1e_1 + \dots + x_ne_n. Now, by computation, we see that

\left \langle f((a_1, \dots , a_n)), f((b_1, \dots , b_n)) \right \rangle
= \left \langle a_1e_1 + \dots + a_ne_n, b_1e_1 + \dots + b_ne_n \right \rangle
= \sum \limits_{i=1}^n a_i \left \langle e_i, b_1e_1 + \dots + b_ne_n \right \rangle
= \sum \limits_{i=1}^n \sum \limits_{j=1}^n a_i b_j \left \langle e_i, e_j \right \rangle
= a_1b_1 + \dots a_nb_n = \left \langle (a_1, \dots a_n), (b_1, \dots b_n) \right \rangle \square

Now, all we must prove is that every finite-dimensional inner product space has an orthonormal basis.

Theorem (Gram-Schmidt): Every basis of a finite-dimensional inner product space may be transformed into an orthonormal basis.

Proof: We do so by construction, using the previously introduced projection maps p_y(x). Given a basis (x_1, \dots , x_n), we compute the following algorithm:

  1. \displaystyle e_1 = \frac{x_1}{\left \| x_1 \right \|}
  2. For each i = 2, \dots , n:

    1. Let z_i = \sum \limits_{j=1}^{i-1} p_{e_j}(x_i)
    2. \displaystyle e_i = \frac{x_i - z_i}{\left \| x_i - z_i \right \|}

The reader may verify computationally that this process produces orthonormal vectors, but the argument is better phrased geometrically: each z_i is the projection of some new vector onto the subspace generated by all previously computed orthonormal vectors. By subtracting x_i - z_i, we take the part of x_i that is orthogonal to all vectors in that subspace. This is guaranteed to be nonzero because of the linear independence of our original list. And so, e_i is orthogonal to every vector preceding it in the algorithm (with indices j < i). Finally, we normalize each e_i to make them unit vectors, and we are done. \square

We note that this algorithm takes O(n^3), since we may compute the O(n^2) needed inner products ahead of time, and then there remains O(n) steps to compute each of the e_i.

This result now proves that every real finite-dimensional inner product space is isometric to \mathbb{R}^n. With this new insight, we may effectively do all our work in \mathbb{R}^n with the usual dot product, realizing that the results there hold for all relevant inner product spaces. In other words, our “simplification” at the beginning of this post, restricting our work to \mathbb{R}^n, was not at all a simplification. Proving statements about \mathbb{R}^n gives us equivalent statements about all real finite-dimensional inner product spaces. Wonderful!

Bases of Eigenvectors

There is one more important topic we wish to discuss here: the importance of bases of eigenvectors. In particular, given a linear operator A: V \to V, if one has a basis of eigenvectors for A, then A has a diagonal representation.

In particular, if A has a basis of eigenvectors (v_1, \dots , v_n), then the expansion of Av_i in terms of the basis vectors is just \lambda_i v_i, where \lambda_i is the corresponding eigenvalue. Thus, the matrix corresponding to A looks like:

\begin{pmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n  \end{pmatrix}

Here we count multiplicity of eigenvalues.

The existence of a diagonal representation for a matrix has interesting computational implications. For instance, we often wish to take high powers of a matrix, such as in counting paths in a graph, or working with graphical computations. Unfortunately, each successive matrix multiplication takes O(n^2) computations. If we wish to compute A^m, this takes O(mn^2) time. However, if the matrix has a diagonal representation, we may spend the O(n^2) it takes to convert the matrix to its diagonal form, take powers of the diagonal matrix by simply taking the powers of the diagonal entries, and then convert it back. Indeed, multiplying two diagonal matrices together is just as easy as multiplying the diagonal entries together, as the reader may verify. This optimization reduces computation to O(n^2 + mn) = O(nm), since we are assuming m is very large.

Of course, then we are left with the problem of quickly computing eigenvectors. What worries us even more is that we might not have a basis of eigenvectors (some matrices don’t have any!). We instead take a slightly different route, which serves our purposes better. Specifically, we will be using this information to compute eigenvectors of symmetric matrices (A = A^{\textup{T}}). For this, we refer to a grand theorem:

The Spectral Theorem: Every real symmetric matrix has an orthonormal basis consisting of eigenvectors.

The proof goes beyond the scope of this post (see: characteristic polynomials and self-adjoint operators), but it is very useful for us. In particular, by finding these eigenvectors we may both have a diagonal representation for our matrix, and also compute projections in a jiffy! We will see the astounding applications of this quite soon.

So Even with two primers on linear algebra, we have still only scratched the surface of this wonderful subject. In the future we may continue this series of primers by discussing the linear algebra inherent in many optimization problems. Be sure to look out for it.

Until next time!

Linear Algebra – A Primer

Story Time

Linear algebra was founded around the same time as Calculus (think Leibniz, circa 1700) solely for the purpose of solving general systems of linear equations. The coefficients of a system were written in a grid form, with rows corresponding to equations and columns to the unknown variables. Using a computational tool called the determinant (an awkward, but computable formula involving only the coefficients of the equations in a system), researchers were able to solve these systems, opening a world of information about the positions of celestial bodies and large-scale measurements (of geodesic arcs) on the surface of the earth.

By the 1850’s, Arthur Cayley was representing matrices as abstract objects. He defined matrix multiplication and nurtured matrix theory as its own field, recognizing a vast wealth of theoretical knowledge underlying the theory of determinants. Around turn of the century, a formal system of vector algebra was invented which relied heavily on interpreting matrices as so-called linear transformations. Linear transformations are intuitively those maps of everyday space (\mathbb{R}^n) which preserve “linear” things. Specifically, they send lines to lines, planes to planes, etc., and they preserve the origin (one which does not preserve the origin is very similar but has a different name; see Affine Transformation). Soon enough the mathematical focus shifted to the foundations of such an algebra, and later with the advent of computers to rapid calculations in one.


Linear algebra sits at the crossroads of many areas of mathematics. Keeping close to its roots, linear algebra is primarily a tool for computation. Unsurprisingly, a huge chunk of mathematical research has been solely to phrase things in terms of matrices and their associated linear transformations. For instance, an undirected graph on n vertices can be modeled as a matrix of integer entries, with the i,j entry containing the number of edges from vertex i to vertex j. This is called the adjacency matrix of a graph. Suddenly, a wealth of information about the graph translates to simple matrix computations. For instance, we can compute the number of paths from one vertex to another of length m as the appropriate entry of A^m. (more formally,these are walks, which are allowed to repeat edge traversals and visited vertices)

Even in advanced, purely theoretical mathematics, objects are commonly represented in terms of coordinates in some vector space, and are subsequently studied using all of the great things we know about linear transformations and their matrices. And so, without further ado, we will present the terminology and working concepts necessary for the content elsewhere in this blog.

Vector Spaces

The setting for all of linear algebra is in some vector space. Intuitively this is just a collection of objects, which we call vectors, with some rules on how you can combine vectors to get other vectors. This treatment wouldn’t do that idea justice without an axiomatic definition, so here it is.

Definition: A vector space is a quadruple (V, F, +, \cdot), where V is a set of vectors (points in our space), F is a scalar field (coefficients), +:V \times V \to V is a commutative, associative operation to combine vectors, and \cdot: F \times V \to V is an operation to “scale” vectors. In addition, we need the following properties to hold:

  • Addition and multiplication distribute (as we are used to with traditional algebra).
  • There must be an additive identity, which we call 0, giving 0 + v = v for all v \in V.
  • Every vector must have an additive inverse (every v has some w with v + w = 0).

This is a lot to swallow at first, but it is general for a good reason: there are tons of different kinds of vector spaces! Many of these are surprising and counter-intuitive. For our purposes, however, we may stick with the nice, small vector spaces. So here is a simplified definition that will suffice:

Definition: vector space is a set V of vectors which are fixed-length lists of real numbers (v_1, v_2, \dots , v_n) \in \mathbb{R}^n, where addition between vectors is componentwise, we may scale vectors by any real number, and the following properties hold:

  • Addition and multiplication distribute (as above).
  • (0,0,0, \dots, 0) is the additive identity.
  • (-v_1, -v_2, \dots , -v_n) is the unique additive inverse of (v_1, v_2, \dots , v_n).

Hopefully this is much more familiar to what we think of as “vectors,” and with the understanding that we are viewing it as a vector space, we just call it \mathbb{R}^n. The closure of operations gives us a nice way to characterize “any combination” of vectors in a vector space.

Definition: A linear combination of vectors in a vector space V is the vector

a_1v_1 + a_2v_2 + \dots + a_kv_k

for some positive integer k, scalars a_i, and vectors v_i.

We may speak of the span of a set of vectors as the set of all possible linear combinations of those vectors. Furthermore, we call a set of vectors linearly independent if no vector in the list is in the span of the others. For example, (1,0,0), (0,1,0), and (0,0,1) are linearly independent in \mathbb{R}^3. Specifically, (1,0,0) cannot be written as a(0,1,0) + b(0,0,1) = (0,a,b) for any scalars a,b \in F, and the other two vectors are similarly so.

As usual, we may describe subspaces of a vector space, which are just subsets of V which are themselves vector spaces with the inherited operations. The simplest examples of these are lines, planes, and hyperplanes through the origin in \mathbb{R}^n. Consequently, we may identify \mathbb{R}^n as a subspace of \mathbb{R}^m for any n \leq m.

One of the first things we want to ask about a vector space is “how big is it?” While most instances of vector spaces we will see have uncountably many elements, we can characterize “size” in terms of a different metric: the size of a basis.

Definition: A list of vectors (v_1, v_2, \dots v_n) is a basis for V if its elements are linearly independent, and their span is V. The dimension of a vector space is the length of any basis.

For \mathbb{R}^n, and similarly all finite-dimensional vector spaces, it is easy to prove that all bases have the same length, and hence dimension is well-defined. Further, \mathbb{R}^n admits a very natural basis, often called the standard basis:

e_1 = (1,0, \dots, 0)
e_2 = (0,1, \dots, 0)
e_n = (0,0, \dots, 1)

These are best visualized as the coordinate axes in \mathbb{R}^n, and it strokes our intuition as to what a basis should be, because any vector in \mathbb{R}^n can be broken down uniquely into a sum of scalar multiples of these unit coordinates. Indeed, this is true of any basis (due to linear independence). Given a fixed basis for V, every vector v \in V may be uniquely written as a linear combination of basis vectors.

Linear Transformations and their Matrix Representations

Moving quickly toward the heart of linear algebra, we may speak of linear transformations (interchangeably, linear maps) between two vector spaces:

Definition: A function f : V \to W is a linear map if it preserves the operations of addition and scalar multiplication. In other words, for all v, w \in V, c \in F, f(v+w) = f(v)+f(w) and f(cv) = cf(v).

Examples are bountiful; some geometrically inspired ones include rotations about the origin, shears, and scalings. These are functions you’d likely see in an image manipulation program like photoshop. From this we can prove a few basic facts, like that every linear map sends 0 to 0 and additive inverses to additive inverses (try it as an exercise).

One remarkable fact that helps us characterize linear maps is that every linear map is determined completely by what it does to a basis. Since every vector x \in V is a linear combination of basis elements, say x=a_1v_1 + \dots + a_nv_n, we see that a linear map plays nicely:

f(x) = f(a_1v_1 + \dots + a_nv_n) = a_1f(v_1) + \dots + a_nf(v_n)

In other words, if we know what f does to a basis, then we know everything about f. In order to aid our computations, we write what f does to each basis vector in a tabular form. To elaborate on the vague word “does,” we need to also fix a basis of our target vector space W, say (w_1, \dots , w_m), and describe each f(v_i) in terms of this basis. We write it in tabular form, as follows:

\begin{pmatrix} | & | & \mathbf{ } & | \\ f(v_1) & f(v_2) & \dots & f(v_n) \\ | & | & \mathbf{ } & | \end{pmatrix}

The jth column corresponds to f(v_j), and the ith row corresponds to the ith coefficient in the expansion of f(v_j) in terms of the basis for W. Here the vertical bars indicate that each element is a column of scalars. We will do an extended example to make this clear.

Consider the map f on \mathbb{R}^3 defined as (x,y,z) \mapsto (y,x,2z+y). It is easy to check this map is linear, and using the standard basis we see that

f(1,0,0) = (0,1,0),
f(0,1,0) = (1,0,1), and
f(0,0,1) = (0,0,2).


f(e_1) = e_2f(e_2) = e_1 + e_3, and f(e_3) = 2e_3.

Hence, the matrix representation of f with respect to the standard basis is

A = \begin{pmatrix} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 2 \end{pmatrix}

Now we see that if we take a (column) vector x, and multiply it on the left by our matrix A, the resulting vector is precisely the coordinate representation of f(x) with respect to the basis for W. In fact, the rules for matrix multiplication were constructed very particularly so that this would be the case. In this way, we may arbitrarily switch between viewing f as a transformation and a vector computation. Compositions of linear maps translate to multiplication of two matrices, and matrix inversion (if it exists) is precisely function inversion.

Of course, there are many different bases we could have chosen. Even though we are going from \mathbb{R}^3 \to \mathbb{R}^3, the column basis could be different from the row basis. Fortunately for our purposes, we are not going to consider what basis is appropriate to choose. All that matters is that fixing a basis, the matrix representation of a linear map is unique, and so we may interchange the notation freely. Even so, the truly interesting things about matrices are those properties which are true no matter which basis we prefer to use.

Eigenvectors and Eigenvalues

Definition: A scalar \lambda \in F is an eigenvalue for the linear map A if there exists a non-zero vector v \in V with Av = \lambda v. Any such vector v which satisfies this equation is said to be an eigenvector of A corresponding to \lambda.

Eigenvectors and eigenvalues have a huge number of applications, including facial recognition software, geology, quantum mechanics, and web search. So being able to find them quickly is of great significance to researchers and engineers. What’s interesting is that while eigenvectors depend on a choice of basis, eigenvalues do not. We prove this now:

Proposition: If A and B are different representations of the same linear map, then any eigenvalue of B is an eigenvalue of A.

Proof. It turns out that the process of “changing a basis” can be boiled down to matrix multiplication. Specifically, if A and B are two different matrix representations of the same linear map, we have the existence of some invertible matrix P such that A = PBP^{-1}, or AP = PB. As a result, if v is an eigenvector for B corresponding to the eigenvalue \lambda, then for some APv = PBv = P \lambda v = \lambda Pv and so A(Pv) = \lambda(Pv), and Pv is an eigenvector for A corresponding to \lambda as well. This proves that eigenvalues are invariant with respect to a change of basis, as desired. \square

The point of this is that we can choose whatever basis we want to work with, and compute the eigenvalues where we’re most comfortable. For instance, if we choose a basis that gives the following diagonal representation,

A = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3 \end{pmatrix}

then we can just eyeball that the eigenvalues are 1, 2, and 3. In fact, there are some very deep theorems in linear algebra that concern the existence and uniqueness of certain matrix representations. For a more in-depth treatment, see Axler, Linear Algebra Done Right. We will cover all the necessary information in the relevant posts, but until then, we are absolutely pooped from typing. Until next time!