# (Finite) Fields — A Primer

So far on this blog we’ve given some introductory notes on a few kinds of algebraic structures in mathematics (most notably groups and rings, but also monoids). Fields are the next natural step in the progression.

If the reader is comfortable with rings, then a field is extremely simple to describe: they’re just commutative rings with 0 and 1, where every nonzero element has a multiplicative inverse. We’ll give a list of all of the properties that go into this “simple” definition in a moment, but an even more simple way to describe a field is as a place where “arithmetic makes sense.” That is, you get operations for $+,-, \cdot , /$ which satisfy the expected properties of addition, subtraction, multiplication, and division. So whatever the objects in your field are (and sometimes they are quite weird objects), they behave like usual numbers in a very concrete sense.

So here’s the official definition of a field. We call a set $F$ a field if it is endowed with two binary operations addition ($+$) and multiplication ($\cdot$, or just symbol juxtaposition) that have the following properties:

• There is an element we call 0 which is the identity for addition.
• Addition is commutative and associative.
• Every element $a \in F$ has a corresponding additive inverse $b$ (which may equal $a$) for which $a + b = 0$.

These three properties are just the axioms of a (commutative) group, so we continue:

• There is an element we call 1 (distinct from 0) which is the identity for multiplication.
• Multiplication is commutative and associative.
• Every nonzero element $a \in F$ has a corresponding multiplicative inverse $b$ (which may equal $a$) for which $ab = 1$.
• Addition and multiplication distribute across each other as we expect.

If we exclude the existence of multiplicative inverses, these properties make $F$ a commutative ring, and so we have the following chain of inclusions that describes it all

$\displaystyle \textup{Fields} \subset \textup{Commutative Rings} \subset \textup{Rings} \subset \textup{Commutative Groups} \subset \textup{Groups}$

The standard examples of fields are the real numbers $\mathbb{R}$, the rationals $\mathbb{Q}$, and the complex numbers $\mathbb{C}$. But of course there are many many more. The first natural question to ask about fields is: what can they look like?

For example, can there be any finite fields? A field $F$ which as a set has only finitely many elements?

As we saw in our studies of groups and rings, the answer is yes! The simplest example is the set of integers modulo some prime $p$. We call them $\mathbb{Z} / p \mathbb{Z},$ or sometimes just $\mathbb{Z}/p$ for short, and let’s rederive what we know about them now.

As a set, $\mathbb{Z}/p$ consists of the integers $\left \{ 0, 1, \dots, p-1 \right \}$. The addition and multiplication operations are easy to define, they’re just usual addition and multiplication followed by a modulus. That is, we add by $a + b \mod p$ and multiply with $ab \mod p$. This thing is clearly a commutative ring (because the integers form a commutative ring), so to show this is a field we need to show that everything has a multiplicative inverse.

There is a nice fact that allows us to do this: an element $a$ has an inverse if and only if the only way for it to divide zero is the trivial way $0a = 0$. Here’s a proof. For one direction, suppose $a$ divides zero nontrivially, that is there is some $c \neq 0$ with $ac = 0$. Then if $a$ had an inverse $b$, then $0 = b(ac) = (ba)c = c$, but that’s very embarrassing for $c$ because it claimed to be nonzero. Now suppose $a$ only divides zero in the trivial way. Then look at all possible ways to multiply $a$ by other nonzero elements of $F$. No two can give you the same result because if $ax = ay$ then (without using multiplicative inverses) $a(x-y) = 0$, but we know that $a$ can only divide zero in the trivial way so $x=y$. In other words, the map “multiplication by $a$” is injective. Because the set of nonzero elements of $F$ is finite you have to hit everything (the map is in fact a bijection), and some $x$ will give you $ax = 1$.

Now let’s use this fact on $\mathbb{Z}/p$ in the obvious way. Since $p$ is a prime, there are no two smaller numbers $a, b < p$ so that $ab = p$. But in $\mathbb{Z}/p$ the number $p$ is equivalent to zero (mod $p$)! So $\mathbb{Z}/p$ has no nontrivial zero divisors, and so every element has an inverse, and so it’s a finite field with $p$ elements.

The next question is obvious: can we get finite fields of other sizes? The answer turns out to be yes, but you can get finite fields of any size. Let’s see why.

## Characteristics and Vector Spaces

Say you have a finite field $k$ (lower-case k is the standard letter for a field, so let’s forget about $F$). Beacuse the field is finite, if you take 1 and keep adding it to itself you’ll eventually run out of field elements. That is, $n = 1 + 1 + \dots + 1 = 0$ at some point. How do I know it’s zero and doesn’t keep cycling never hitting zero? Well if at two points $n = m \neq 0$, then $n-m = 0$ is a time where you hit zero, contradicting the claim.

Now we define $\textup{char}(k)$, the characteristic of $k$, to be the smallest $n$ (sums of 1 with itself) for which $n = 0$. If there is no such $n$ (this can happen if $k$ is infinite, but doesn’t always happen for infinite fields), then we say the characteristic is zero. It would probably make more sense to say the characteristic is infinite, but that’s just the way it is. Of course, for finite fields the characteristic is always positive. So what can we say about this number? We have seen lots of example where it’s prime, but is it always prime? It turns out the answer is yes!

For if $ab = n = \textup{char}(k)$ is composite, then by the minimality of $n$ we get $a,b \neq 0$, but $ab = n = 0$. This can’t happen by our above observation, because being a zero divisor means you have no inverse! Contradiction, sucker.

But it might happen that there are elements of $k$ that can’t be written as $1 + 1 + \dots + 1$ for any number of terms. We’ll construct examples in a minute (in fact, we’ll classify all finite fields), but we already have a lot of information about what those fields might look like. Indeed, since every field has 1 in it, we just showed that every finite field contains a smaller field (a subfield) of all the ways to add 1 to itself. Since the characteristic is prime, the subfield is a copy of $\mathbb{Z}/p$ for $p = \textup{char}(k)$. We call this special subfield the prime subfield of $k$.

The relationship between the possible other elements of $k$ and the prime subfield is very neat. Because think about it: if $k$ is your field and $F$ is your prime subfield, then the elements of $k$ can interact with $F$ just like any other field elements. But if we separate $k$ from $F$ (make a separate copy of $F$), and just think of $k$ as having addition, then the relationship with $F$ is that of a vector space! In fact, whenever you have two fields $k \subset k'$, the latter has the structure of a vector space over the former.

Back to finite fields, $k$ is a vector space over its prime subfield, and now we can impose all the power and might of linear algebra against it. What’s it’s dimension? Finite because $k$ is a finite set! Call the dimension $m$, then we get a basis $v_1, \dots, v_m$. Then the crucial part: every element of $k$ has a unique representation in terms of the basis. So they are expanded in the form

$\displaystyle f_1v_1 + \dots + f_mv_m$

where the $f_i$ come from $F$. But now, since these are all just field operations, every possible choice for the $f_i$ has to give you a different field element. And how many choices are there for the $f_i$? Each one has exactly $|F| = \textup{char}(k) = p$. And so by counting we get that $k$ has $p^m$ many elements.

This is getting exciting quickly, but we have to pace ourselves! This is a constraint on the possible size of a finite field, but can we realize it for all choices of $p, m$? The answer is again yes, and in the next section we’ll see how.  But reader be warned: the formal way to do it requires a little bit of familiarity with ideals in rings to understand the construction. I’ll try to avoid too much technical stuff, but if you don’t know what an ideal is, you should expect to get lost (it’s okay, that’s the nature of learning new math!).

## Constructing All Finite Fields

Let’s describe a construction. Take a finite field $k$ of characteristic $p$, and say you want to make a field of size $p^m$. What we need to do is construct a field extension, that is, find a bigger field containing $k$ so that the vector space dimension of our new field over $k$ is exactly $m$.

What you can do is first form the ring of polynomials with coefficients in $k$. This ring is usually denoted $k[x]$, and it’s easy to check it’s a ring (polynomial addition and multiplication are defined in the usual way). Now if I were speaking to a mathematician I would say, “From here you take an irreducible monic polynomial $p(x)$ of degree $m$, and quotient your ring by the principal ideal generated by $p$. The result is the field we want!”

In less compact terms, the idea is exactly the same as modular arithmetic on integers. Instead of doing arithmetic with integers modulo some prime (an irreducible integer), we’re doing arithmetic with polynomials modulo some irreducible polynomial $p(x)$. Now you see the reason I used $p$ for a polynomial, to highlight the parallel thought process. What I mean by “modulo a polynomial” is that you divide some element $f$ in your ring by $p$ as much as you can, until the degree of the remainder is smaller than the degree of $p(x)$, and that’s the element of your quotient. The Euclidean algorithm guarantees that we can do this no matter what $k$ is (in the formal parlance, $k[x]$ is called a Euclidean domain for this very reason). In still other words, the “quotient structure” tells us that two polynomials $f, g \in k[x]$ are considered to be the same in $k[x] / p$ if and only if $f - g$ is divisible by $p$. This is actually the same definition for $\mathbb{Z}/p$, with polynomials replacing numbers, and if you haven’t already you can start to imagine why people decided to study rings in general.

Let’s do a specific example to see what’s going on. Say we’re working with $k = \mathbb{Z}/3$ and we want to compute a field of size $27 = 3^3$. First we need to find a monic irreducible polynomial of degree $3$. For now, I just happen to know one: $p(x) = x^3 - x + 1$. In fact, we can check it’s irreducible, because to be reducible it would have to have a linear factor and hence a root in $\mathbb{Z}/3$. But it’s easy to see that if you compute $p(0), p(1), p(2)$ and take (mod 3) you never get zero.

So I’m calling this new ring

$\displaystyle \frac{\mathbb{Z}/3[x]}{(x^3 - x + 1)}$

It happens to be a field, and we can argue it with a whole lot of ring theory. First, we know an irreducible element of this ring is also prime (because the ring is a unique factorization domain), and prime elements generate maximal ideals (because it’s a principal ideal domain), and if you quotient by a maximal ideal you get a field (true of all rings).

But if we want to avoid that kind of argument and just focus on this ring, we can explicitly construct inverses. Say you have a polynomial $f(x)$, and for illustration purposes we’ll choose $f(x) = x^4 + x^2 - 1$. Now in the quotient ring we could do polynomial long division to find remainders, but another trick is just to notice that the quotient is equivalent to the condition that $x^3 = x - 1$. So we can reduce $f(x)$ by applying this rule to $x^4 = x^3 x$ to get

$\displaystyle f(x) = x^2 + x(x-1) - 1 = 2x^2 - x - 1$

Now what’s the inverse of $f(x)$? Well we need a polynomial $g(x) = ax^2 + bx + c$ whose product with $f$ gives us something which is equivalent to 1, after you reduce by $x^3 - x + 1$. A few minutes of algebra later and you’ll discover that this is equivalent to the following polynomial being identically 1

$\displaystyle (a-b+2c)x^2 + (-3a+b-c)x + (a - 2b - 2c) = 1$

In other words, we get a system of linear equations which we need to solve:

\displaystyle \begin{aligned} a & - & b & + & 2c & = 0 \\ -3a & + & b & - & c &= 0 \\ a & - & 2b & - & 2c &= 1 \end{aligned}

And from here you can solve with your favorite linear algebra techniques. This is a good exercise for working in fields, because you get to abuse the prime subfield being characteristic 3 to say terrifying things like $-1 = 2$ and $6b = 0$. The end result is that the inverse polynomial is $2x^2 + x + 1$, and if you were really determined you could write a program to compute these linear systems for any input polynomial and ensure they’re all solvable. We prefer the ring theoretic proof.

In any case, it’s clear that taking a polynomial ring like this and quotienting by a monic irreducible polynomial gives you a field. We just control the size of that field by choosing the degree of the irreducible polynomial to our satisfaction. And that’s how we get all finite fields!

## One Last Word on Irreducible Polynomials

One thing we’ve avoided is the question of why irreducible monic polynomials exist of all possible degrees $m$ over any $\mathbb{Z}/p$ (and as a consequence we can actually construct finite fields of all possible sizes).

The answer requires a bit of group theory to prove this, but it turns out that the polynomial $x^{p^m} - x$ has all degree $m$ monic irreducible polynomials as factors. But perhaps a better question (for computer scientists) is how do we work over a finite field in practice? One way is to work with polynomial arithmetic as we described above, but this has some downsides: it requires us to compute these irreducible monic polynomials (which doesn’t sound so hard, maybe), to do polynomial long division every time we add, subtract, or multiply, and to compute inverses by solving a linear system.

But we can do better for some special finite fields, say where the characteristic is 2 (smells like binary) or we’re only looking at $F_{p^2}$. The benefit there is that we aren’t forced to use polynomials. We can come up with some other kind of structure (say, matrices of a special form) which happens to have the same field structure and makes computing operations relatively painless. We’ll see how this is done in the future, and see it applied to cryptography when we continue with our series on elliptic curve cryptography.

Until then!

# Fixing Bugs in “Computing Homology”

A few awesome readers have posted comments in Computing Homology to the effect of, “Your code is not quite correct!” And they’re right! Despite the almost year since that post’s publication, I haven’t bothered to test it for more complicated simplicial complexes, or even the basic edge cases! When I posted it the mathematics just felt so solid to me that it had to be right (the irony is rich, I know).

As such I’m apologizing for my lack of rigor and explaining what went wrong, the fix, and giving some test cases. As of the publishing of this post, the Github repository for Computing Homology has been updated with the correct code, and some more examples.

The main subroutine was the simultaneousReduce function which I’ll post in its incorrectness below

def simultaneousReduce(A, B):
if A.shape[1] != B.shape[0]:
raise Exception("Matrices have the wrong shape.")

numRows, numCols = A.shape # col reduce A

i,j = 0,0
while True:
if i >= numRows or j >= numCols:
break

if A[i][j] == 0:
nonzeroCol = j
while nonzeroCol < numCols and A[i,nonzeroCol] == 0:
nonzeroCol += 1

if nonzeroCol == numCols:
j += 1
continue

colSwap(A, j, nonzeroCol)
rowSwap(B, j, nonzeroCol)

pivot = A[i,j]
scaleCol(A, j, 1.0 / pivot)
scaleRow(B, j, 1.0 / pivot)

for otherCol in range(0, numCols):
if otherCol == j:
continue
if A[i, otherCol] != 0:
scaleAmt = -A[i, otherCol]
colCombine(A, otherCol, j, scaleAmt)
rowCombine(B, j, otherCol, -scaleAmt)

i += 1; j+= 1

return A,B


It’s a beast of a function, and the persnickety detail was just as beastly: this snippet should have an $i += 1$ instead of a $j$.

if nonzeroCol == numCols:
j += 1
continue


This is simply what happens when we’re looking for a nonzero entry in a row to use as a pivot for the corresponding column, but we can’t find one and have to move to the next row. A stupid error on my part that would be easily caught by proper test cases.

The next mistake is a mathematical misunderstanding. In short, the simultaneous column/row reduction process is not enough to get the $\partial_{k+1}$ matrix into the right form! Let’s see this with a nice example, a triangulation of the Mobius band. There are a number of triangulations we could use, many of which are seen in these slides. The one we’ll use is the following.

It’s first and second boundary maps are as follows (in code, because latex takes too much time to type out)

mobiusD1 = numpy.array([
[-1,-1,-1,-1, 0, 0, 0, 0, 0, 0],
[ 1, 0, 0, 0,-1,-1,-1, 0, 0, 0],
[ 0, 1, 0, 0, 1, 0, 0,-1,-1, 0],
[ 0, 0, 0, 1, 0, 0, 1, 0, 1, 1],
])

mobiusD2 = numpy.array([
[ 1, 0, 0, 0, 1],
[ 0, 0, 0, 1, 0],
[-1, 0, 0, 0, 0],
[ 0, 0, 0,-1,-1],
[ 0, 1, 0, 0, 0],
[ 1,-1, 0, 0, 0],
[ 0, 0, 0, 0, 1],
[ 0, 1, 1, 0, 0],
[ 0, 0,-1, 1, 0],
[ 0, 0, 1, 0, 0],
])


And if we were to run the above code on it we’d get a first Betti number of zero (which is incorrect, it’s first homology group has rank 1). Here are the reduced matrices.

>>> A1, B1 = simultaneousReduce(mobiusD1, mobiusD2)
>>> A1
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])
>>> B1
array([[ 0,  0,  0,  0,  0],
[ 0,  0,  0,  0,  0],
[ 0,  0,  0,  0,  0],
[ 0,  0,  0,  0,  0],
[ 0,  1,  0,  0,  0],
[ 1, -1,  0,  0,  0],
[ 0,  0,  0,  0,  1],
[ 0,  1,  1,  0,  0],
[ 0,  0, -1,  1,  0],
[ 0,  0,  1,  0,  0]])


The first reduced matrix looks fine; there’s nothing we can do to improve it. But the second one is not quite fully reduced! Notice that rows 5, 8 and 10 are not linearly independent. So we need to further row-reduce the nonzero part of this matrix before we can read off the true rank in the way we described last time. This isn’t so hard (we just need to reuse the old row-reduce function we’ve been using), but why is this allowed? It’s just because the corresponding column operations for those row operations are operating on columns of all zeros! So we need not worry about screwing up the work we did in column reducing the first matrix, as long as we only work with the nonzero rows of the second.

Of course, nothing is stopping us from ignoring the “corresponding” column operations, since we know we’re already done there. So we just have to finish row reducing this matrix.

This changes our bettiNumber function by adding a single call to a row-reduce function which we name so as to be clear what’s happening. The resulting function is

def bettiNumber(d_k, d_kplus1):
A, B = numpy.copy(d_k), numpy.copy(d_kplus1)
simultaneousReduce(A, B)
finishRowReducing(B)

dimKChains = A.shape[1]
kernelDim = dimKChains - numPivotCols(A)
imageDim = numPivotRows(B)

return kernelDim - imageDim


And running this on our Mobius band example gives:

>>> bettiNumber(mobiusD1, mobiusD2))
1


As desired. Just to make sure things are going swimmingly under the hood, we can check to see how finishRowReducing does after calling simultaneousReduce

>>> simultaneousReduce(mobiusD1, mobiusD2)
(array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]]), array([[ 0,  0,  0,  0,  0],
[ 0,  0,  0,  0,  0],
[ 0,  0,  0,  0,  0],
[ 0,  0,  0,  0,  0],
[ 0,  1,  0,  0,  0],
[ 1, -1,  0,  0,  0],
[ 0,  0,  0,  0,  1],
[ 0,  1,  1,  0,  0],
[ 0,  0, -1,  1,  0],
[ 0,  0,  1,  0,  0]]))
>>> finishRowReducing(mobiusD2)
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])


Indeed, finishRowReducing finishes row reducing the second boundary matrix. Note that it doesn’t preserve how the rows of zeros lined up with the pivot columns of the reduced version of $\partial_1$ as it did in the previous post, but since in the end we’re only counting pivots it doesn’t matter how we switch rows. The “zeros lining up” part is just for a conceptual understanding of how the image lines up with the kernel for a valid simplicial complex.

In fixing this issue we’ve also fixed an issue another commenter mentioned, that you couldn’t blindly plug in the zero matrix for $\partial_0$ and get zeroth homology (which is the same thing as connected components). After our fix you can.

Of course there still might be bugs, but I have so many drafts lined up on this blog (and research papers to write, experiments to run, theorems to prove), that I’m going to put off writing a full test suite. I’ll just have to update this post with new bug fixes as they come. There’s just so much math and so little time But extra kudos to my amazing readers who were diligent enough to run examples and spot my error. I’m truly blessed to have you on my side.

Also note that this isn’t the most efficient way to represent the simplicial complex data, or the most efficient row reduction algorithm. If you’re going to run the code on big inputs, I suggest you take advantage of sparse matrix algorithms for doing this sort of stuff. You can represent the simplices as entries in a dictionary and do all sorts of clever optimizations to make the algorithm effectively linear time in the number of simplices.

Until next time!

# How to Conquer Tensorphobia

A professor at Stanford once said,

If you really want to impress your friends and confound your enemies, you can invoke tensor products… People run in terror from the $\otimes$ symbol.

He was explaining some aspects of multidimensional Fourier transforms, but this comment is only half in jest; people get confused by tensor products. It’s often for good reason. People who really understand tensors feel obligated to explain it using abstract language (specifically, universal properties). And the people who explain it in elementary terms don’t really understand tensors.

This post is an attempt to bridge the gap between the elementary and advanced understandings of tensors. We’ll start with the elementary (axiomatic) approach, just to get a good feel for the objects we’re working with and their essential properties. Then we’ll transition to the “universal” mode of thought, with the express purpose of enlightening us as to why the properties are both necessary and natural.

But above all, we intend to be sufficiently friendly so as to not make anybody run in fear. This means lots of examples and preferring words over symbols. Unfortunately, we simply can’t get by without the reader knowing the very basics of linear algebra (the content of our first two primers on linear algebra (1) (2), though the only important part of the second is the definition of an inner product).

So let’s begin.

## Tensors as a Bunch of Axioms

Before we get into the thick of things I should clarify some basic terminology. Tensors are just vectors in a special vector space. We’ll see that such a vector space comes about by combining two smaller vector spaces via a tensor product. So the tensor product is an operation combining vector spaces, and tensors are the elements of the resulting vector space.

Now the use of the word product is quite suggestive, and it may lead one to think that a tensor product is similar or related to the usual direct product of vector spaces. In fact they are related (in very precise sense), but they are far from similar. If you were pressed, however, you could start with the direct product of two vector spaces and take a mathematical machete to it until it’s so disfigured that you have to give it a new name (the tensor product).

With that image in mind let’s see how that is done. For the sake of generality we’ll talk about two arbitrary finite-dimensional vector spaces $V, W$ of dimensions $n, m$. Recall that the direct product  $V \times W$ is the vector space of pairs $(v,w)$ where $v$ comes from $V$ and $w$ from $W$. Recall that addition in this vector space is defined componentwise ($(v_1,w_1) + (v_2, w_2) = (v_1 + v_2, w_1 + w_2$)) and scalar multiplication scales both components $\lambda (v,w) = (\lambda v, \lambda w)$.

To get the tensor product space $V \otimes W$, we make the following modifications. First, we redefine what it means to do scalar multiplication. In this brave new tensor world, scalar multiplication of the whole vector-pair is declared to be the same as scalar multiplication of any component you want. In symbols,

$\displaystyle \lambda (v, w) = (\lambda v, w) = (v, \lambda w)$

for all choices of scalars $\lambda$ and vectors $v, w$. Second, we change the addition operation so that it only works if one of the two components are the same. In symbols, we declare that

$(v, w) + (v', w) = (v + v', w)$

only works because $w$ is the same in both pieces, and with the same rule applying if we switch the positions of $v,w$ above. All other additions are simply declared to be new vectors. I.e. $(x,y) + (z,w)$ is simply itself. It’s a valid addition — we need to be able to add stuff to be a vector space — but you just can’t combine it any further unless you can use the scalar multiplication to factor out some things so that $y=w$ or $x=z$. To say it still one more time, a general element of the tensor $V \otimes W$ is a sum of these pairs that can or can’t be combined by addition (in general things can’t always be combined).

Finally, we rename the pair $(v,w)$ to $v \otimes w$, to distinguish it from the old vector space $V \times W$ that we’ve totally butchered and reanimated, and we call the tensor product space as a whole $V \otimes W$. Those familiar with this kind of abstract algebra will recognize quotient spaces at work here, but we won’t use that language except to note that we cover quotients and free spaces elsewhere on this blog, and that’s the formality we’re ignoring.

As an example, say we’re taking the tensor product of two copies of $\mathbb{R}$. This means that our space $\mathbb{R} \otimes \mathbb{R}$ is comprised of vectors like $3 \otimes 5$, and moreover that the following operations are completely legitimate.

$3 \otimes 5 + 1 \otimes (-5) = 3 \otimes 5 + (-1) \otimes 5 = 2 \otimes 5$

$6 \otimes 1 + 3\pi \otimes \pi = 3 \otimes 2 + 3 \otimes \pi^2 = 3 \otimes (2 + \pi^2)$

Cool. This seemingly innocuous change clearly has huge implications on the structure of the space. We’ll get to specifics about how different tensors are from regular products later in this post, but for now we haven’t even proved this thing is a vector space. It might not be obvious, but if you go and do the formalities and write the thing as a quotient of a free vector space (as we mentioned we wouldn’t do) then you know that quotients of vector spaces are again vector spaces. So we get that one for free. But even without that it should be pretty obvious: we’re essentially just declaring that all the axioms of a vector space hold when we want them to. So if you were wondering whether

$\lambda (a \otimes b + c \otimes d) = \lambda(a \otimes b) + \lambda(c \otimes d)$

The answer is yes, by force of will.

So just to recall, the axioms of a tensor space $V \otimes W$ are

1. The “basic” vectors are $v \otimes w$ for $v \in V, w \in W$, and they’re used to build up all other vectors.
2. Addition is symbolic, unless one of the components is the same in both addends, in which case $(v_1, w) + (v_2, w) = (v_1+ v_2, w)$ and $(v, w_1) + (v,w_2) = (v, w_1 + w_2)$.
3. You can freely move scalar multiples around the components of $v \otimes w$.
4. The rest of the vector space axioms (distributivity, additive inverses, etc) are assumed with extreme prejudice.

Naturally, one can extend this definition to $n$-fold tensor products, like $V_1 \otimes V_2 \otimes \dots \otimes V_d$. Here we write the vectors as sums of things like $v_1 \otimes \dots \otimes v_d$, and we enforce that addition can only be combined if all but one coordinates are the same in the addends, and scalar multiples move around to all coordinates equally freely.

## So where does it come from?!

By now we have this definition and we can play with tensors, but any sane mathematically minded person would protest, “What the hell would cause anyone to come up with such a definition? I thought mathematics was supposed to be elegant!”

It’s an understandable position, but let me now try to convince you that tensor products are very natural. The main intrinsic motivation for the rest of this section will be this:

We have all these interesting mathematical objects, but over the years we have discovered that the maps between objects are the truly interesting things.

A fair warning: although we’ll maintain a gradual pace and informal language in what follows, by the end of this section you’ll be reading more or less mature 20th-century mathematics. It’s quite alright to stop with the elementary understanding (and skip to the last section for some cool notes about computing), but we trust that the intrepid readers will push on.

So with that understanding we turn to multilinear maps. Of course, the first substantive thing we study in linear algebra is the notion of a linear map between vector spaces. That is, a map $f: V \to W$ that factors through addition and scalar multiplication (i.e. $f(v + v') = f(v) + f(v')$ and $f(\lambda v) = \lambda f(v)$).

But it turns out that lots of maps we work with have much stronger properties worth studying. For example, if we think of matrix multiplication as an operation, call it $m$, then $m$ takes in two matrices and spits out their product

$m(A,B) = AB$

Now what would be an appropriate notion of linearity for this map? Certainly it is linear in the first coordinate, because if we fix $B$ then

$m(A+C, B) = (A+C)B = AB + CB = m(A,B) + m(C,B)$

And for the same reason it’s linear in the second coordinate. But it is most definitely not linear in both coordinates simultaneously. In other words,

$m(A+B, C+D) = (A+B)(C+D) = AC + AD + BC + BD \neq AC + BD = m(A,C) + m(B,D)$

In fact, if the only operation satisfying linearity in its two coordinates separately and also this kind of linearity is the zero map! (Try to prove this as an exercise.) So the strongest kind of linearity we could reasonably impose is that $m$ is linear in each coordinate when all else is fixed. Note that this property allows us to shift around scalar multiples, too. For example,

$\displaystyle m(\lambda A, B) = \lambda AB = A (\lambda B) = m(A, \lambda B) = \lambda m(A,B)$

Starting to see the wispy strands of a connection to tensors? Good, but hold it in for a bit longer. This single-coordinate-wise-linear property is called bilinearity when we only have two coordinates, and multilinearity when we have more.

Here are some examples of nice multilinear maps that show up everywhere:

• If $V$ is an inner product space over $\mathbb{R}$, then the inner product is bilinear.
• The determinant of a matrix is a multilinear map if we view the columns of the matrix as vector arguments.
• The cross product of vectors in $\mathbb{R}^3$ is bilinear.

There are many other examples, but you should have at least passing familiarity with these notions, and it’s enough to convince us that multilinearity is worth studying abstractly.

And so what tensors do is give a sort of classification of multilinear maps. The idea is that every multilinear map $f$ from a product vector space $U_1 \times \dots \times U_d$ to any vector space $Y$ can be written first as a multilinear map to the tensor space

$\displaystyle \alpha : U_1 \times \dots \times U_d \to U_1 \otimes \dots \otimes U_d$

Followed by a linear map to $Y$,

$\displaystyle \hat{f} : U_1 \otimes \dots \otimes U_d \to Y$

And the important part is that $\alpha$ doesn’t depend on the original $f$ (but $\hat{f}$ does). One usually draws this as a single diagram:

And to say this diagram commutes is to say that all possible ways to get from one point to another are equivalent (the compositions of the corresponding maps you follow are equal, i.e. $f = \hat{f} \alpha$).

In fuzzy words, the tensor product is like the gatekeeper of all multilinear maps, and $\alpha$ is the gate. Yet another way to say this is that $\alpha$ is the most general possible multilinear map that can be constructed from $U_1 \times \dots \times U_d$. Moreover, the tensor product itself is uniquely defined by having a “most-general” $\alpha$ (up to isomorphism). This notion is often referred to by mathematicians as the “universal property” of the tensor product. And they might say something like “the tensor product is initial with respect to multilinear mappings from the standard product.” We discuss language like this in detail in this blog’s series on category theory, but it’s essentially a super-compact (and almost too vague) way of saying what the diagram says.

Let’s explore this definition when we specialize to a tensor of two vector spaces, and it will give us a good understanding of $\alpha$ (which is really incredibly simple, but people like to muck it up with choices of coordinates and summations). So fix $V, W$ as vector spaces and look at the diagram

What is $\alpha$ in this case? Well it just sends $(v,w) \mapsto v \otimes w$. Is this map multilinear? Well if we fix $w$ then

$\displaystyle \alpha(v_1 + v_2, w) = (v_1 + v_2) \otimes w = v_1 \otimes w + v_2 \otimes w = \alpha(v_1, w) + \alpha (v_2, w)$

and

$\displaystyle \alpha(\lambda v, w) = (\lambda v) \otimes w = (\lambda) (v \otimes w) = \lambda \alpha(v,w)$

And our familiarity with tensors now tells us that the other side holds too. Actually, rather than say this is a result of our “familiarity with tensors,” the truth is that this is how we know that we need to define the properties of tensors as we did. It’s all because we designed tensors to be the gatekeepers of multilinear maps!

So now let’s prove that all maps $f : V \times W \to Y$ can be decomposed into an $\alpha$ part and a $\hat{f}$ part. To do this we need to know what data uniquely defines a multilinear map. For usual linear maps, all we had to do was define the effect of the map on each element of a basis (the rest was uniquely determined by the linearity property). We know what a basis of $V \times W$ is, it’s just the union of the bases of the pieces. Say that $V$ has a basis $v_1, \dots, v_n$ and $W$ has $w_1, \dots, w_m$, then a basis for the product is just $((v_1, 0), \dots, (v_n,0), (0,w_1), \dots, (0,w_m))$.

But multilinear maps are more nuanced, because they have two arguments. In order to say “what they do on a basis” we really need to know how they act on all possible pairs of basis elements. For how else could we determine $f(v_1 + v_2, w_1)$? If there are $n$ of the $v_i$‘s and $m$ of the $w_i$‘s, then there are $nm$ such pairs $f(v_i, w_j)$.

Uncoincidentally, as $V \otimes W$ is a vector space, its basis can also be constructed in terms of the bases of $V$ and $W$. You simply take all possible tensors $v_i \otimes w_j$. Since every $v \in V, w \in W$ can be written in terms of their bases, it’s clear than any tensor $\sum_{k} a_k \otimes b_k$ can also be written in terms of the basis tensors $v_i \otimes w_j$ (by simply expanding each $a_k, b_k$ in terms of their respective bases, and getting a larger sum of more basic tensors).

Just to drive this point home, if $(e_1, e_2, e_3)$ is a basis for $\mathbb{R}^3$, and $(g_1, g_2)$ a basis for $\mathbb{R}^2$, then the tensor space $\mathbb{R}^3 \otimes \mathbb{R}^2$ has basis

$(e_1 \otimes g_1, e_1 \otimes g_2, e_2 \otimes g_1, e_2 \otimes g_2, e_3 \otimes g_1, e_3 \otimes g_2)$

It’s a theorem that finite-dimensional vector spaces of equal dimension are isomorphic, so the length of this basis (6) tells us that $\mathbb{R}^3 \otimes \mathbb{R}^2 \cong \mathbb{R}^6$.

So fine, back to decomposing $f$. All we have left to do is use the data given by $f$ (the effect on pairs of basis elements) to define $\hat{f} : V \otimes W \to Y$. The definition is rather straightforward, as we have already made the suggestive move of showing that the basis for the tensor space ($v_i \otimes w_j$) and the definition of $f$ ($f(v_i, w_j)$) are essentially the same.

That is, just take $\hat{f}(v_i \otimes w_j) = f(v_i, w_j)$. Note that this is just defined on the basis elements, and so we extend to all other vectors in the tensor space by imposing linearity (defining $\hat{f}$ to split across sums of tensors as needed). Is this well defined? Well, multilinearity of $f$ forces it to be so. For if we had two equal tensors, say, $\lambda v \otimes w = v \otimes \lambda w$, then we know that $f$ has to respect their equality, because $f(\lambda v_i, w_j) = f(v_i, \lambda w_j)$, so $\hat{f}$ will take the same value on equal tensors regardless of which representative we pick (where we decide to put the $\lambda$). The same idea works for sums, so everything checks out, and $f(v,w)$ is equal to $\hat{f} \alpha$, as desired. Moreover, we didn’t make any choices in constructing $\hat{f}$. If you retrace our steps in the argument, you’ll see that everything was essentially decided for us once we fixed a choice of a basis (by our wise decisions in defining $V \otimes W$). Since the construction would be isomorphic if we changed the basis, our choice of $\hat{f}$ is unique.

There is a lot more to say about tensors, and indeed there are some other useful ways to think about tensors that we’ve completely ignored. But this discussion should make it clear why we define tensors the way we do. Hopefully it eliminates most of the mystery in tensors, although there is still a lot of mystery in trying to compute stuff using tensors. So we’ll wrap up this post with a short discussion about that.

## Computability and Stuff

It should be clear by now that plain product spaces $V \times W$ and tensor product spaces $V \otimes W$ are extremely different. In fact, they’re only related in that their underlying sets of vectors are built from pairs of vectors in $V$ and $W$. Avid readers of this blog will also know that operations involving matrices (like row reduction, eigenvalue computations, etc.) are generally efficient, or at least they run in polynomial time so they’re not crazy impractically slow for modest inputs.

On the other hand, it turns out that almost every question you might want to ask about tensors is difficult to answer computationally. As with the definition of the tensor product, this is no mere coincidence. There is something deep going on with tensors, and it has serious implications regarding quantum computing. More on that in a future post, but for now let’s just focus on one hard problem to answer for tensors.

As you know, the most general way to write an element of a tensor space $U_1 \otimes \dots \otimes U_d$ is as a sum of the basic-looking tensors.

$\displaystyle \sum_{k} a_{1,k} \otimes a_{2,k} \otimes \dots \otimes a_{d,k}$

where the $a_{i,k}$ may be sums of vectors from $U_i$ themselves. But as we saw with our examples over $\mathbb{R}$, there can be lots of different ways to write a tensor. If you’re lucky, you can write the entire tensor as a one-term sum, that is just a tensor $a \otimes b$. If you can do this we call the tensor a pure tensor, or a rank 1 tensor. We then have the following natural definition and problem:

Definition: The rank of a tensor $x \in U_1 \otimes \dots \otimes U_d$ is the minimum number of terms in any representation of $x$ as a sum of pure tensors. The one exception is the zero element, which has rank zero by convention.

Problem: Given a tensor $x \in k^{n_1} \otimes k^{n_2} \otimes k^{n_3}$ where $k$ is a field, compute its rank.

Of course this isn’t possible in standard computing models unless you can represent the elements of the field (and hence the elements of the vector space in question) in a computer program. So we restrict $k$ to be either the rational numbers $\mathbb{Q}$ or a finite field $\mathbb{F}_{q}$.

Even though the problem is simple to state, it was proved in 1990 (a result of Johan Håstad) that tensor rank is hard to compute. Specifically, the theorem is that

Theorem: Computing tensor rank is NP-hard when $k = \mathbb{Q}$ and NP-complete when $k$ is a finite field.

The details are given in Håstad’s paper, but the important work that followed essentially showed that most problems involving tensors are hard to compute (many of them by reduction from computing rank). This is unfortunate, but also displays the power of tensors. In fact, tensors are so powerful that many believe understanding them better will lead to insight in some very important problems, like finding faster matrix multiplication algorithms or proving circuit lower bounds (which is closely related to P vs NP). Finding low-rank tensor approximations is also a key technique in a lot of recent machine learning and data mining algorithms.

With this in mind, the enterprising reader will probably agree that understanding tensors is both valuable and useful. In the future of this blog we’ll hope to see some of these techniques, but at the very least we’ll see the return of tensors when we delve into quantum computing.

Until next time!

# The Two-Dimensional Fourier Transform and Digital Watermarking

We’ve studied the Fourier transform quite a bit on this blog: with four primers and the Fast Fourier Transform algorithm under our belt, it’s about time we opened up our eyes to higher dimensions.

Indeed, in the decades since Cooley & Tukey’s landmark paper, the most interesting applications of the discrete Fourier transform have occurred in dimensions greater than 1. But for all our work we haven’t yet discussed what it means to take an “n-dimensional” Fourier transform. Our past toiling and troubling will pay off, though, because the higher Fourier transform and its 1-dimensional cousin are quite similar. Indeed, the shortest way to describe the $n$-dimensional transform is as the 1-dimensional transform with inner products of vector variables replacing regular products of variables.

In this post we’ll flush out these details. We’ll define the multivariable Fourier transform and it’s discrete partner, implement an algorithm to compute it (FFT-style), and then apply the transform to the problem of digitally watermarking images.

As usual, all the code, images, and examples used in this post are available on this blog’s Github page.

## Sweeping Some Details Under the Rug

We spent our first and second primers on Fourier analysis describing the Fourier series in one variable, and taking a limit of the period to get the Fourier transform in one variable. By all accounts, it was a downright mess of notation and symbol manipulation that culminated in the realization that the Fourier series looks a lot like a Riemann sum. So it was in one dimension, it is in arbitrary dimension, but to save our stamina for the applications we’re going to treat the $n$-dimensional transform differently. We’ll use the 1-dimensional transform as a model, and magically generalize it to operate on a vector-valued variable. Then the reader will take it on faith that we could achieve the same end as a limit of some kind of multidimensional Fourier series (and all that nonsense with Schwarz functions and tempered distributions is left to the analysts), or if not we’ll provide external notes with the full details.

So we start with a real-valued (or complex-valued) function $f : \mathbb{R}^n \to \mathbb{R}$, and we write the variable as $x = (x_1, \dots, x_n)$, so that we can stick to using the notation $f(x)$. Rather than think of the components of $x$ as “time variables” as we did in the one-dimensional case, we’ll usually think of $x$ as representing physical space. And so the periodic behavior of the function $f$ represents periodicity in space. On the other hand our transformed variables will be “frequency” in space, and this will correspond to a vector variable $\xi = (\xi_1, \dots, \xi_n)$. We’ll come back to what the heck “periodicity in space” means momentarily.

Remember that in one dimension the Fourier transform was defined by

$\displaystyle \mathscr{F}f(s) = \int_{-\infty}^\infty e^{-2\pi ist}f(t) dt$.

And it’s inverse transform was

$\displaystyle \mathscr{F}^{-1}g(t) = \int_{-\infty}^\infty e^{2\pi ist}f(s) ds$.

Indeed, with the vector $x$ replacing $t$ and $\xi$ replacing $s$, we have to figure out how to make an analogous definition. The obvious thing to do is to take the place where $st$ is multiplied and replace it with the inner product of $x$ and $\xi$, which for this post I’ll write $x \cdot \xi$ (usually I write $\left \langle x, \xi \right \rangle$). This gives us the $n$-dimensional transform

$\displaystyle \mathscr{F}f(\xi) = \int_{\mathbb{R}^n} e^{-2\pi i x \cdot \xi}f(x) dx$,

and its inverse

$\displaystyle \mathscr{F}^{-1}g(t) = \int_{\mathbb{R}^n} e^{2\pi i x \cdot \xi}f( \xi ) d \xi$

Note that the integral is over all of $\mathbb{R}^n$. To give a clarifying example, if we are in two dimensions we can write everything out in coordinates: $x = (x_1, x_2), \xi = (\xi_1, \xi_2)$, and the formula for the transform becomes

$\displaystyle \mathscr{F}f(\xi_1, \xi_2) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-2 \pi i (x_1 \xi_1 + x_2 \xi_2)} f(\xi_1, \xi_2) dx_1 dx_2$.

Now that’s a nasty integral if I’ve ever seen one. But for our purposes in this post, this will be as nasty as it gets, for we’re primarily concerned with image analysis. So representing things as vectors of arbitrary dimension is more compact, and we don’t lose anything for it.

## Periodicity in Space? It’s All Mostly the Same

Because arithmetic with vectors and arithmetic with numbers is so similar, it turns out that most of the properties of the 1-dimensional Fourier transform hold in arbitrary dimension. For example, the duality of the Fourier transform and its inverse holds, because for vectors $e^{-2 \pi i x \cdot (-\xi)} = e^{2 \pi i x \cdot \xi}$. So just like in on dimension, we have

$\mathscr{F}f(-\xi) = \mathscr{F}^{-1}f(\xi)$

And again we have correspondences between algebraic operations: convolution in the spatial domain corresponds to convolution in the frequency domain, the spectrum is symmetric about the origin, etc.

At a more geometric level, though, the Fourier transform does the same sort of thing as it did in the one-dimensional case. Again the complex exponentials form the building blocks of any function we want, and performing a Fourier transform on an $n$-dimensional function decomposes that function into its frequency components. So a function that is perfectly periodic corresponds to a Fourier spectrum that’s perfectly concentrated at a point.

But what the hell, the reader might ask, is ‘periodicity in space’? Since we’re talking about images anyway, the variables we care about (the coordinates of a pixel) are spatial variables. You could, if you were so inclined, have a function of multiple time variables, and to mathematicians a physical interpretation of dimension is just that, an interpretation. But as confusing as it might sound, it’s actually not so hard to understand the Fourier transform when it’s specialized to image analysis. The idea is that complex exponentials $e^{\pm 2 \pi i s \cdot \xi}$ oscillate in the $x$ variable for a fixed $\xi$ (and since $\mathscr{F}$ has $\xi$ as its input, we do want to fix $\xi$). The brief mathematical analysis goes like this: if we fix $\xi$ then the complex exponential is periodic with magnitudinal peaks along parallel lines spaced out at a distance of $1/ \left \| \xi \right \|$ apart. In particular any image is a sum of a bunch of these “complex exponential with a fixed $\xi$” images that look like stripes with varying widths and orientations (what you see here is just the real part of a particular complex exponential).

Any image can be made from a sum of a whole lot of images like the ones on top. They correspond to single points in the Fourier spectrum (and their symmetries), as on bottom.

What you see on top is an image, and on bottom its Fourier spectrum. That is, each brightly colored pixel corresponds to a point $[x_1, x_2]$ with a large magnitude for that frequency component $|\mathscr{F}f[x_1, x_2]|$.

It might be a bit surprising that every image can be constructed as a sum of stripey things, but so was it that any sound can be constructed as a sum of sines and cosines. It’s really just a statement about a basis of some vector space of functions. The long version of this story is laid out beautifully in pages 4 – 7 of these notes. The whole set of notes is wonderful, but this section is mathematically tidy and needs no background; the remainder of the notes outline the details about multidimensional Fourier series mentioned earlier, as well as a lot of other things. In higher dimensions the “parallel lines” idea is much the same, but with lines replaced by hyperplanes normal to the given vector.

## Discretizing the Transform

Recall that for a continuous function $f$ of one variable, we spent a bit of time figuring out how to find a good discrete approximation of $f$, how to find a good discrete approximation of the Fourier transform $\mathscr{F}f$, and how to find a quick way to transition between the two. In brief: $f$ was approximated by a vector of samples $(f[0], f[1], \dots, f[N])$, reconstructed the original function (which was only correct at the sampled points) and computed the Fourier transform of that, calling it the discrete Fourier transform, or DFT. We got to this definition, using square brackets to denote list indexing (or vector indexing, whatever):

Definition: Let $f = (f[1], \dots f[N])$ be a vector in $\mathbb{R}^N$. Then the discrete Fourier transform of $f$ is defined by the vector $(\mathscr{F}f[1], \dots, \mathscr{F}f[N])$, where

$\displaystyle \mathscr{F}f[j] = \sum_{k=0}^{N-1} f[k]e^{-2 \pi i jk/N}$

Just as with the one-dimensional case, we can do the same analysis and arrive at a discrete approximation of an $n$-dimensional function. Instead of a vector it would be an $N \times N \times \dots \times N$ matrix, where there are $n$ terms in the matrix, one for each variable. In two dimensions, this means the discrete approximation of a function is a matrix of samples taken at evenly-spaced intervals in both directions.

Sticking with two dimensions, the Fourier transform is then a linear operator taking matrices to matrices (which is called a tensor if you want to scare people). It has its own representation like the one above, where each term is a double sum. In terms of image analysis, we can imagine that each term in the sum requires us to look at every pixel of the original image

Definition: Let $f = (f[s,t])$ be a vector in $\mathbb{R}^N \times \mathbb{R}^M$, where $s$ ranges from $0, \dots, N-1$ and $t$ from $0, \dots, M-1$. Then the discrete Fourier transform of $f$ is defined by the vector $(\mathscr{F}f[s,t])$, where each entry is given by

$\displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s, t] e^{-2 \pi i (s x_1 / N + t x_2 / M)}$

In the one-dimensional case the inverse transform had a sign change in the exponent and an extra $1/N$ normalization factor. Similarly, in two dimensions the inverse transform has a normalization factor of $1/NM$ (1 over the total number of samples). Again we use a capital $F$ to denote the transformed version of $f$. The higher dimensional transforms are analogous: you get $n$ sums, one for each component, and the normalization factor is the inverse of the total number of samples.

$\displaystyle \mathscr{F}^{-1}F[x_1, x_2] = \frac{1}{NM} \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s,t] e^{2 \pi i (sx_1 / N + tx_2 / M)}$

Unfortunately, the world of the DFT disagrees a lot on the choice of normalization factor. It turns out that all that really matters is that the exponent is negated in the inverse, and that the product of the constant terms on both the transform and its inverse is $1/NM$. So some people will normalize both the Fourier transform and its inverse by $1/ \sqrt{NM}$. The reason for this is that it makes the transform and its inverse more similar-looking (it’s just that, cosmetic). The choice of normalization isn’t particularly important for us, but beware: non-canonical choices are out there, and they do affect formulas by adding multiplicative constants.

## The Fast Fourier Transform, Revisited

Now one might expect that there is another clever algorithm to drastically reduce the runtime of the 2-dimensional DFT, akin to the fast Fourier transform algorithm (FFT). But actually there is almost no additional insight required to understand the “fast” higher dimensional Fourier transform algorithm, because all the work was done for us in the one dimensional case.

All that we do is realize that each of the inner summations is a 1-dimensional DFT. That is, if we write the inner-most sum as a function of two parameters

$\displaystyle g(s, x_2) = \sum_{t=0}^{M-1} f(s,t) e^{-2 \pi i (tx_2 / M)}$

then the 2-dimensional FFT is simply

$\displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} g(s, x_2) e^{-2 \pi i (sx_1/N)}$

But now notice, that we can forget that $g(s,x_2)$ was ever a separate, two-dimensional function. Indeed, since it only depends on the $x_2$ parameter from out of the sum this is precisely the formula for a 1-dimensional DFT! And so if we want to compute the 2-dimensional DFT using the 1-dimensional FFT algorithm, we can compute the matrix of 1-dimensional DFT entries for all choices of $s, x_2$ by fixing each value of $s$ in turn and running FFT on the resulting “column” of values. If you followed the program from our last FFT post, then the only difficulty is in understanding how the data is shuffled around and which variables are fixed during the computation of the sub-DFT’s.

To remedy the confusion, we give an example. Say we have the following 3×3 matrix whose DFT we want to compute. Remember, these values are the sampled values of a 2-variable function.

$\displaystyle \begin{pmatrix} f[0,0] & f[0,1] & f[0,2] \\ f[1,0] & f[1,1] & f[1,2] \\ f[2,0] & f[2,1] & f[2,2] \end{pmatrix}$

The first step in the algorithm is to fix a choice of row, $s$, and compute the DFT of the resulting row. So let’s fix $s = 0$, and then we have the resulting row

$\displaystyle f_0 = (f[0,0], f[0,1], f[0,2])$

It’s DFT is computed (intentionally using the same notation as the inner summation above), as

$\displaystyle g[0,x_2] = (\mathscr{F}f_0)[x_2] = \sum_{t=0}^{M-1} f_0[t] e^{- 2 \pi i (t x_2 / M)}$

Note that $f_0[t] = f[s,t]$ for our fixed choice of $s=0$. And so if we do this for all $N$ rows (all 3 rows, in this example), we’ll have performed $N$ FFT’s of size $M$ to get a matrix of values

$\displaystyle \begin{pmatrix} g[0,0] & g[0,1] & g[0,2] \\ g[1,0] & g[1,1] & g[1,2] \\ g[2,0] & g[2,1] & g[2,2] \end{pmatrix}$

Now we want to compute the rest of the 2-dimensional DFT to the end, and it’s easy: now each column consists of the terms in the outermost sum above (since $s$ is the iterating variable). So if we fix a value of $x_2$, say $x_2 = 1$, we get the resulting column

$\displaystyle g_1 = (g[0, 1], g[1,1], g[2,1])$

and computing a DFT on this row gives

$\displaystyle \mathscr{F}f[x_1, 1] = \sum_{s=0}^{N-1} g_1[s] e^{-2 \pi i sx_1 / N}$.

Expanding the definition of $g$ as a DFT gets us back to the original formula for the 2-dimensional DFT, so we know we did it right. In the end we get a matrix of the computed DFT values for all $x_1, x_2$.

Let’s analyze the runtime of this algorithm: in the first round of DFT’s we computed $N$ DFT’s of size $M$, requiring a total of $O(N M \log M)$, since we know FFT takes time $O(M \log M)$ for a list of length $M$. In the second round we did it the other way around, computing $M$ DFT’s of size $N$ each, giving a total of

$O(NM \log M + NM \log N) = O(NM (\log N + \log M)) = O(NM \log (NM))$

In other words, if the size of the image is $n = NM$, then we are achieving an $O(n \log n)$-time algorithm, which was precisely the speedup that the FFT algorithm gave us for one-dimension. We also know a lower bound on this problem: we can’t do better than $NM$ since we have to look at every pixel at least once. So we know that we’re only a logarithmic factor away from a trivial lower bound. And indeed, all other known DFT algorithms have the same runtime. Without any assumptions on the input data (or any parallelization), nobody knows of a faster algorithm.

Now let’s turn to the code. If we use our FFT algorithm from last time, the pure Python one (read: very slow), then we can implement the 2D Fourier transform in just two lines of Python code. Full disclosure: we left out some numpy stuff in this code for readability. You can view the entire source file on this blog’s Github page.

def fft2d(matrix):
fftRows = [fft(row) for row in matrix]
return transpose([fft(row) for row in transpose(fftRows)])


And we can test it on a simple matrix with one nonzero value in it:

A = [[0,0,0,0], [0,1,0,0], [0,0,0,0], [0,0,0,0]]
for row in fft2d(A):
print(', '.join(['%.3f + %.3fi' % (x.real, x.imag) for x in row]))


The output is (reformatted in LaTeX, obviously):

$\displaystyle \begin{pmatrix} 1 & -i & -1 & i \\ -i & -1 & i & 1 \\ -1 & i & 1 & -i \\ i & 1 & -i & -1 \end{pmatrix}$

The reader can verify by hand that this is correct (there’s only one nonzero term in the double sum, so it just boils down to figuring out the complex exponential $e^{2 \pi i (x_1 + x_2 / 4)}$). We leave it as an additional exercise to the reader to implement the inverse transform, as well as to generalize this algorithm to higher dimensional DFTs.

## Some Experiments and Animations

As we did with the 1-dimensional FFT, we’re now going to switch to using an industry-strength FFT algorithm for the applications. We’ll be using the numpy library and its “fft2″ function, along with scipy’s ndimage module for image manipulation. Getting all of this set up was a nightmare (thank goodness for people who guide users like me through this stuff, but even then the headache seemed unending!). As usual, all of the code and images used in the making of this post is available on this blog’s Github page.

And so we can start playing with a sample image, a still from one of my favorite television shows:

The Fourier transform of this image (after we convert it to grayscale) can be computed in python:

def fourierSpectrumExample(filename):
unshiftedfft = numpy.fft.fft2(A)
spectrum = numpy.log10(numpy.absolute(unshiftedfft) + numpy.ones(A.shape))
misc.imsave("%s-spectrum-unshifted.png" % (filename.split('.')[0]), spectrum)


With the result:

The Fourier spectrum of Sherlock and Watson (and London).

A few notes: we use the ndimage library to load the image and flatten the colors to grayscale. Then, after we compute the spectrum, we shift and take a logarithm. This is because the raw spectrum values are too massive; plotting them without modification makes the image contrast too high.

Something is odd, though, because the brightest regions are on the edges of the image, where we might expect the highest-frequency elements to be. Actually, it turns out that a raw DFT (as computed by numpy, anyhow) is “shifted.” That is, the indices are much like they were in our original FFT post, so that the “center” of the spectrum (the lowest frequency component) is actually in the corner of the image array.

The numpy folks have a special function designed to alleviate this called fftshift. Applying it before we plot the image gives the following spectrum:

Now that’s more like it. For more details on what’s going on with shifting and how to use the shifting functions, see this matlab thread. (As a side note, the “smudges” in this image are interesting. We wonder what property of the original image contributes to the smudges)

Shifted or unshifted, this image represents the frequency spectrum of the image. In other words, we could take the inverse DFT of each pixel (and its symmetric partner) of this image separately, add them all together, and get back to our original image! We did just that using a different image (one of size 266 x 189, requiring a mere 25137 frequency components), to produce this video of the process:

Many thanks to James Hance for his relentlessly cheerful art (I have a reddish version of this particular masterpiece on my bedroom wall).

For the interested reader, I followed this youtube video’s recommended workflow to make the time-lapsed movie, along with some additional steps to make the videos play side by side. It took quite a while to generate and process the images, and the frames take up a lot of space. So instead of storing all the frames, the interested reader may find the script used to generate the frames on this blog’s Github page (along with all of the rest of the code used in this blog post).

## Digital Watermarking

Now we turn to the main application of Fourier transforms to this post, the task of adding an invisible digital watermark to an image. Just in case the reader lives in a cave, a watermark is a security device used to protect the ownership or authenticity of a particular good. Usually they’re used on money to prevent counterfeits, but they’re often applied to high-resolution images on the web to protect copyrights. But perhaps more than just protect existing copyrights, watermarks as they’re used today are ugly, and mostly prevent people from taking the image (paid for or not) in the first place. Here’s an example from a big proponent of ugly watermarks, Shutterstock.com.

Now if you were the business of copyright litigation, you’d make a lot of money by suing people who took your clients’ images without permission. So rather than prevent people from stealing in the first place, you could put in an invisible watermark into all of your images and then crawl the web looking for stolen images with your watermark. It would be easy enough to automate (Google already did most of the work for you, if you just want to use Google’s search by image feature).

Now I’m more on the side of Fair Use For All, so I wouldn’t hope for a company to actually implement this and make using the internet that much scarier of a place. But the idea makes for an interesting thought experiment and blog post. The idea is simply to modify the spectrum of an image by adding in small, artificial frequency components. That is, the watermarked image will look identical to the original image to a human, but the Fourier spectrum will contain suspicious entries that we can extract if we know where to look.

Implementing the watermarking feature is quite easy, so let’s do that first. Let’s work again with James Hance’s fine artwork.

Let’s call our image’s pixel matrix $A$ and say we’re working with grayscale images for simplicity (for color, we just do the same thing to all three color channels). Then we can define a watermark matrix $W$ by the following procedure:

1. Pick a radius $r$, a length $L$, a watermark strength $\alpha$, and a secret key $k$.
2. Using $k$ as a seed to a random number generator, define a random binary vector $v$ of length $L$.
3. Pick a subset $S$ of the circle of coordinates centered at the image’s center of radius $r$, chosen or rejected based on the entries of $v$.
4. Let $W$ be the matrix of all zeros (of the same dimension as $A$ with 1′s in the entries of $S$.
5. Compute the watermarked image as $\mathscr{F}^{-1}(\mathscr{F}(A) + \alpha W)$. That is, compute the DFT of $A$, add $\alpha W$ to it, and then compute the inverse Fourier transform of the result.

The code for this is simple enough. To create a random vector:

import random
def randomVector(seed, length):
random.seed(secretKey)
return [random.choice([0,1]) for _ in range(length)]


To make the watermark (and flush out all of the technical details of how it’s done:

def makeWatermark(imageShape, radius, secretKey, vectorLength=50):
watermark = numpy.zeros(imageShape)
center = (int(imageShape[0] / 2) + 1, int(imageShape[1] / 2) + 1)

vector = randomVector(secretKey, vectorLength)

x = lambda t: center[0] + int(radius * math.cos(t * 2 * math.pi / vectorLength))
y = lambda t: center[1] + int(radius * math.sin(t * 2 * math.pi / vectorLength))
indices = [(x(t), y(t)) for t in range(vectorLength)]

for i,location in enumerate(indices):
watermark[location] = vector[i]

return watermark


We use the usual parameterization of the circle as $t \mapsto (\cos(2 \pi t / n), \sin(2 \pi t / n)$ scaled to the appropriate radius. Here’s what the watermark looks like as a spectrum:

It’s hard to see the individual pixels, so click it to enlarge.

And then applying a given watermark to an image is super simple.

def applyWatermark(imageMatrix, watermarkMatrix, alpha):
shiftedDFT = fftshift(fft2(imageMatrix))
watermarkedDFT = shiftedDFT + alpha * watermarkMatrix
watermarkedImage = ifft2(ifftshift(watermarkedDFT))

return watermarkedImage


And that’s all there is to it! One might wonder how the choice of $\alpha$ affects the intensity of the watermark, and indeed here we show a few example values of this method applied to Hance’s piece:

Click to enlarge. The effects are most visible in the rightmost image where alpha = 1,000,000

It appears that it’s not until $\alpha$ becomes egregiously large (over 10,000) that we visibly notice the effects. This could be in part due to the fact that this is an image of a canvas (which has lots of small textures in the background). But it’s good to keep in mind the range of acceptable values when designing a decoding mechanism.

Indeed, a decoding mechanism is conceptually much messier; it’s the art to the encoding mechanism’s science. This paper details one possible way to do it, which is essentially to scale everything up or down to 512×512 pixels and try circles of every possible radius until you find one (or don’t) which is statistically similar to the your random vector. And note that since we have the secret key we can generate the exact same random vector. So what the author of that paper suggests is to extract each circle of pixels from the Fourier spectrum, treating it as a single vector with first entry at angle 0. Then you do some statistical magic (compute cross-correlation or some other similarity measure) between the extracted pixels and your secret-key-generated random vector. If they’re sufficiently similar, then you’ve found your watermark, and otherwise there’s no watermark present.

The code required to do this only requires a few extra lines that aren’t present in the code we’re already presented in this article (numpy does cross-correlation for you), so we leave it as an exercise to the reader: write a program that determines if an image contains our watermark, and test the algorithm on various $\alpha$ and with modifications of the image like rotation, scaling, cropping, and jpeg compression. Part of the benefit of Fourier-based techniques is the resilience of the spectrum to mild applications of these transformations.

Next time we’ll use the Fourier transform to do other cool things to images, like designing filters and combining images in interesting ways.

Until then!