# Learning to Love Complex Numbers

This post is intended for people with a little bit of programming experience and no prior mathematical background.

Numbers are curious things. On one hand, they represent one of the most natural things known to humans, which is quantity. It’s so natural to humans that even newborn babies are in tune with the difference between quantities of objects between 1 and 3, in that they notice when quantity changes much more vividly than other features like color or shape.

But our familiarity with quantity doesn’t change the fact that numbers themselves (as an idea) are a human invention. And they’re not like most human inventions, the kinds where you have to tinker with gears or circuits to get a machine that makes your cappuccino. No, these are mathematical inventions. These inventions exist only in our minds.

Numbers didn’t always exist. A long time ago, back when the Greeks philosophers were doing their philosophizing, negative numbers didn’t exist! In fact, it wasn’t until 1200 AD that the number zero was first considered in Europe. Zero, along with negative numbers and fractions and square roots and all the rest, were invented primarily to help people solve more problems than they could with the numbers they had available. That is, numbers were invented primarily as a way for people to describe their ideas in a useful way. People simply  wondered “is there a number whose square gives you 2?” And after a while they just decided there was and called it $\sqrt{2}$ because they didn’t have a better name for it.

But with these new solutions came a host of new problems. You see, although I said mathematical inventions only exist in our minds, once they’re invented they gain a life of their own. You start to notice patterns in your mathematical objects and you have to figure out why they do the things they do. And numbers are a perfectly good example of this: once I notice that I can multiply a number by itself, I can ask how often these “perfect squares” occur. That is, what’s the pattern in the numbers $1^2, 2^2, 3^2, 4^2, \dots$? If you think about it for a while, you’ll find that square numbers have a very special relationship with odd numbers.

Other times, however, the things you invent turn out to make no sense at all, and you can prove they never existed in the first place! It’s an odd state of affairs, but we’re going to approach the subject of complex numbers from this mindset. We’re going to come up with a simple idea, the idea that negative numbers can be perfect squares, and explore the world of patterns it opens up. Along the way we’ll do a little bit of programming to help explore, give some simple proofs to solidify our intuition, and by the end we’ll see how these ideas can cause wonderful patterns like this one:

## The number i

Let’s bring the story back around to squares. One fact we all remember about numbers is that squaring a number gives you something non-negative. $7^2 = 49, (-2)^2 = 4, 0^2 = 0$, and so on. But it certainly doesn’t have to be this way. What if we got sick of that stupid fact and decided to invent a new number whose square was negative? Which negative, you ask? Well it doesn’t really matter, because I can always stretch it larger or smaller so that it’s square is -1.

Let’s see how: if you say that your made-up number $x$ makes $x^2 = -7$, then I can just use $\frac{x}{\sqrt{7}}$ to get a number whose square is -1. If you’re going to invent a number that’s supposed to interact with our usual numbers, then you have to be allowed to add, subtract, and multiply $x$ with regular old real numbers, and the usual properties would have to still work. So it would have to be true that $(x / \sqrt{7})^2 = x^2 / \sqrt{7}^2 = -7/7 = -1$.

So because it makes no difference (this is what mathematicians mean by, “without loss of generality”) we can assume that the number we’re inventing will have a square of negative one. Just to line up with history, let’s call the new number $i$. So there it is: $i$ exists and $i^2 = -1$. And now that we are “asserting” that $i$ plays nicely with real numbers, we get these natural rules for adding and subtracting and multiplying and dividing. For example

• $1 + i$ is a new number, which we’ll just call $1+i$. And if we added two of these together, $(1+ i) + (1+i)$, we can combine the real parts and the $i$ parts to get $2 + 2i$. Same goes for subtraction. In general a complex number looks like $a + bi$, because as we’ll see in the other points you can simplify every simple arithmetic expression down to just one “real number” part and one “real number times $i$” part.
• We can multiply $3 \cdot i$, and we’ll just call it $3i$, and we require that multiplication distributes across addition (that the FOIL rule works). So that, for example, $(2 - i)(1 + 3i) = (2 + 6i - i - 3i^2) = (2 + 3) + (6i - i) = (5 + 5i)$.
• Dividing is a significantly more annoying. Say we want to figure out what $1 / (1+i)$ is (in fact, it’s not even obvious that this should look like a regular number! But it does). The $1 / a$ notation just means we’re looking for a number which, when we multiply by the denominator $a$, we get back to 1. So we’re looking to find out when $(a + bi)(1 + i) = 1 + 0i$ where $a$ and $b$ are variables we’re trying to solve for. If we multiply it out we get $(a-b) + (a + b)i = 1 + 0i$, and since the real part and the $i$ part have to match up, we know that $a - b = 1$ and $a + b = 0$. If we solve these two equations, we find that $a = 1/2, b = -1/2$ works great. If we want to figure out something like $(2 + 3i) / (1 - i)$, we just find out what $1 / (1- i)$ is first, and then multiply the result by $(2+3i)$.

So that was tedious and extremely boring, and we imagine you didn’t even read it (that’s okay, it really is boring!). All we’re doing is establishing ground rules for the game, so if you come across some arithmetic that doesn’t make sense, you can refer back to this list to see what’s going on. And once again, for the purpose of this post, we’re asserting that all these laws hold. Maybe some laws follow from others, but as long as we don’t come up with any nasty self-contradictions we’ll be fine.

And now we turn to the real questions: is $i$ the only square root of -1? Does $i$ itself have a square root? If it didn’t, we’d be back to where we started, with some numbers (the non-$i$ numbers) having square roots while others don’t. And so we’d feel the need to make all the $i$ numbers happy by making up more numbers to be their square roots, and then worrying what if these new numbers don’t have square roots and…gah!

I’ll just let you in on the secret to save us from this crisis. It turns out that $i$ does have a square root in terms of other $i$ numbers, but in order to find it we’ll need to understand $i$ from a different angle, and that angle turns out to be geometry.

Geometry? How is geometry going to help me understand numbers!?

It’s a valid question and part of why complex numbers are so fascinating. And I don’t mean geometry like triangles and circles and parallel lines (though there will be much talk of angles), I mean transformations in the sense that we’ll be “stretching,” “squishing,” and “rotating” numbers. Maybe another time I can tell you why for me “geometry” means stretching and rotating; it’s a long but very fun story.

The clever insight is that you can represent complex numbers as geometric objects in the first place. To do it, you just think of $a + bi$ as a pair of numbers $(a,b)$, (the pair of real part and $i$ part), and then plot that point on a plane. For us, the $x$-axis will be the “real” axis, and the $y$-axis will be the $i$-axis. So the number $(3 - 4i)$ is plotted 3 units in the positive $x$ direction and 4 units in the negative $y$ direction. Like this:

The “j” instead of “i” is not a typo, but a disappointing fact about the programming language we used to make this image. We’ll talk more about why later.

We draw it as an arrow for a good reason. Stretching, squishing, rotating, and reflecting will all be applied to the arrow, keeping its tail fixed at the center of the axes. Sometimes the arrow is called a “vector,” but we won’t use that word because here it’s synonymous with “complex number.”

So let’s get started squishing stuff.

## Stretching, Squishing, Rotating

Before we continue I should clear up some names. We call a number that has an $i$ in it a complex number, and we call the part without the $i$ the real part (like 2 in $2-i$) and the part with $i$ the complex part.

Python is going to be a great asset for us in exploring complex numbers, so let’s jump right into it. It turns out that Python natively supports complex numbers, and I wrote a program for drawing complex numbers. I used it to make the plot above. The program depends on a library I hate called matplotlib, and so the point of the program is to shield you from as much pain as possible and focus on complex numbers. You can use the program by downloading it from this blog’s Github page, along with everything else I made in writing this post. All you need to know how to do is call a function, and I’ve done a bit of window dressing removal to simplify things (I really hate matplotlib).

# plotComplexNumbers : [complex] -> None
# display a plot of the given list of complex numbers
def plotComplexNumbers(numbers):
...


Before we show some examples of how to use it, we have to understand how to use complex numbers in Python. It’s pretty simple, except that Python was written by people who hate math, and so they decided the complex number would be represented by $j$ instead of $i$ (people who hate math are sometimes called “engineers,” and they use $j$ out of spite. Not really, though).

So in Python it’s just like any other computation. For example:

>>> (1 + 1j)*(4 - 2j) == (6+2j)
True
>>> 1 / (1+1j)
(0.5-0.5j)

And so calling the plotting function with a given list of complex numbers is as simple as importing the module and calling the function

from plotcomplex import plot
plot.plotComplexNumbers([(-1+1j), (1+2j), (-1.5 - 0.5j), (.6 - 1.8j)])


Here’s the result

So let’s use plots like this one to explore what “multiplication by $i$” does to a complex number. It might not seem exciting at first, but I promise there’s a neat punchline.

Even without plotting it’s pretty easy to tell what multiplying by $i$ does to some numbers. It takes 1 to $i$, moves $i$ to $i^2 = -1$, it takes -1 to $-i$, and $-i$ to $-i \cdot i = 1$.

What’s the pattern in these? well if we plot all these numbers, they’re all at right angles in counter-clockwise order. So this might suggest that multiplication by $i$ does some kind of rotation. Is that always the case? Well lets try it with some other more complicated numbers. Click the plots below to enlarge.

Well, it looks close but it’s hard to tell. Some of the axes are squished and stretched, so it might be that our images don’t accurately represent the numbers (the real world can be such a pain). Well when visual techniques fail, we can attempt to prove it.

Clearly multiplying by $i$ does some kind of rotation, maybe with other stuff too, and it shouldn’t be so hard to see that multiplying by $i$ does the same thing no matter which number you use (okay, the skeptical readers will say that’s totally hard to see, but we’ll prove it super rigorously in a minute). So if we take any number and multiply it by $i$ once, then twice, then three times, then four, and if we only get back to where we started at four multiplications, then each rotation had to be a quarter turn.

Indeed,

$\displaystyle (a + bi) i^4 = (ai - b) i^3 = (-a - bi) i^2 = (-ai + b) i = a + bi$

This still isn’t all that convincing, and we want to be 100% sure we’re right. What we really need is a way to arithmetically compute the angle between two complex numbers in their plotted forms. What we’ll do is find a way to measure the angle of one complex number with the $x$-axis, and then by subtraction we can get angles between arbitrary points. For example, in the figure below $\theta = \theta_1 - \theta_2$.

One way to do this is with trigonometry: the geometric drawing of $a + bi$ is the hypotenuse of a right triangle with the $x$-axis.

And so if $r$ is the length of the arrow, then by the definition of sine and cosine, $\cos(\theta) = a/r, \sin(\theta) = b/r$. If we have $r, \theta$, and $r > 0$, we can solve for a unique $a$ and $b$, so instead of representing a complex number in terms of the pair of numbers $(a,b)$, we can represent it with the pair of numbers $(r, \theta)$. And the conversion between the two is just

$a + bi = r \cos(\theta) + (r \sin(\theta)) i$

The $(r, \theta)$ representation is called the polar representation, while the $(a,b)$ representation is called the rectangular representation or the Cartesian representation. Converting between polar and Cartesian coordinates fills the pages of many awful pre-calculus textbooks (despite the fact that complex numbers don’t exist in classical calculus). Luckily for us Python has built-in functions to convert between the two representations for us.

>>> import cmath
>>> cmath.polar(1 + 1j)
(1.4142135623730951, 0.7853981633974483)
>>> z = cmath.polar(1 + 1j)
>>> cmath.rect(z[0], z[1])
(1.0000000000000002+1j)


It’s a little bit inaccurate on the rounding, but it’s fine for our purposes.

So how do we compute the angle between two complex numbers? Just convert each to the polar form, and subtract the second coordinates. So if we get back to our true goal, to figure out what multiplication by $i$ does, we can just do everything in polar form. Here’s a program that computes the angle between two complex numbers.

def angleBetween(z, w):
zPolar, wPolar = cmath.polar(z), cmath.polar(w)
return wPolar[1] - zPolar[1]

print(angleBetween(1 + 1j, (1 + 1j) * 1j))
print(angleBetween(2 - 3j, (2 - 3j) * 1j))
print(angleBetween(-0.5 + 7j, (-0.5 + 7j) * 1j))


Running it gives

1.5707963267948966
1.5707963267948966
-4.71238898038469


Note that the decimal form of $\pi/2$ is 1.57079…, and that the negative angle is equivalent to $\pi/2$ if you add a full turn of $2\pi$ to it. So programmatically we can see that for every input we try multiplying by $i$ rotates 90 degrees.

But we still haven’t proved it works. So let’s do that now. To say what the angle is between $r \cos (\theta) + ri \sin (\theta)$ and $i \cdot [r \cos (\theta) + ri \sin(\theta)] = -r \sin (\theta) + ri \cos(\theta)$, we need to transform the second number into the usual polar form (where the $i$ is on the sine part and not the cosine part). But we know, or I’m telling you now, this nice fact about sine and cosine:

$\displaystyle \sin(\theta + \pi/2) = cos(\theta)$
$\displaystyle \cos(\theta + \pi / 2) = -\sin(\theta)$

This fact is maybe awkward to write out algebraically, but it’s just saying that if you shift the whole sine curve a little bit you get the cosine curve, and if you keep shifting it you get the opposite of the sine curve (and if you kept shifting it even more you’d eventually get back to the sine curve; they’re called periodic for this reason).

So immediately we can rewrite the second number as $r \cos(\theta + \pi/2) + i r \sin (\theta + \pi/2)$. The angle is the same as the original angle plus a right angle of $\pi/2$. Neat!

Applying this same idea to $(a + bi) \cdot (c + di)$, it’s not much harder to prove that multiplying two complex numbers in general multiplies their lengths and adds their angles. So if a complex number $z$ has its magnitude $r$ smaller than 1, multiplying by $z$ squishes and rotates whatever is being multiplied. And if the magnitude is greater than 1, it stretches and rotates. So we have a super simple geometric understanding of how arithmetic with complex numbers works. And as we’re about to see, all this stretching and rotating results in some really weird (and beautifully mysterious!) mathematics and programs.

But before we do that we still have one question to address, the question that started this whole geometric train of thought: does $i$ have a square root? Indeed, I’m just looking for a number such that, when I square its length and double its angle, I get $i = \cos(\pi/2) + i \sin(\pi/2)$. Indeed, the angle we want is $\pi/4$, and the length we want is $r = 1$, which means $\sqrt{i} = \cos(\pi/4) + i \sin(\pi/4)$. Sweet! There is another root if you play with the signs, see if you can figure it out.

In fact it’s a very deeper and more beautiful theorem (“theorem” means “really important fact”) called the fundamental theorem of algebra. And essentially it says that the complex numbers are complete. That is, we can always find square roots, cube roots, or anything roots of numbers involving $i$. It actually says a lot more, but it’s easier to appreciate the rest of it after you do more math than we’re going to do in this post.

On to pretty patterns!

## The Fractal

So here’s a little experiment. Since every point in the plane is the end of some arrow representing a complex number, we can imagine transforming the entire complex plane by transforming each number by the same rule. The most interesting simple rule we can think of: squaring! So though it might strain your capacity for imagination, try to visualize the idea like this. Squaring a complex number is the same as squaring it’s length and doubling its angle. So imagine: any numbers whose arrows are longer than 1 will grow much bigger, arrows shorter than 1 will shrink, and arrows of length exactly one will stay the same length (arrows close to length 1 will grow/shrink much more slowly than those far away from 1). And complex numbers with small positive angles will increase their angle, but only a bit, while larger angles will grow faster.

Here’s an animation made by Douglas Arnold showing what happens to the set of complex numbers $a + bi$ with $0 \leq a, b \leq 1$ or $-1 < a,b < 0$. Again, imagine every point is the end of a different arrow for the corresponding complex number. The animation is for a single squaring, and the points move along the arc they would travel if one rotated/stretched them smoothly.

So that’s pretty, but this is by all accounts a well-behaved transformation. It’s “predictable,” because for example we can always tell which complex numbers will get bigger and bigger (in length) and which will get smaller.

What if, just for the sake of tinkering, we changed the transformation a little bit? That is, instead of sending $z = a+bi$ to $z^2$ (I’ll often write this $z \mapsto z^2$), what if we sent

$\displaystyle z \mapsto z^2 + 1$

Now it’s not so obvious: which vectors will grow and which will shrink? Notice that it’s odd because adding 1 only changes the real part of the number. So a number whose length is greater than 1 can become small under this transformation. For example, $i$ is sent to $0$, so something slightly larger would also be close to zero. Indeed, $5i/4 \mapsto -9/16$.

So here’s an interesting question: are there any complex numbers that will stay small even if I keep transforming like this forever? Specifically, if I call $f(z) = z^2$, $f^2(z) = f(f(z))$, and likewise $f^k(z)$ for $k$ repeated transformations of $z$, is there a number $z$ so that for every $k$ $f^k(z) < 2$? “Obvious” choices like $z=0$ don’t work, and neither do random guesses like $z=i$ or $z=1$. So should we guess the answer is no?

Before we jump to conclusions let’s write a program to see what happens for more than our random guesses. The program is simple: we’ll define the “square plus one” function, and then repeatedly apply that function to a number for some long number of times (say, 250 times). If the length of the number stays under 2 after so many tries, we’ll call it “small forever,” and otherwise we’ll call it “not small forever.”

def squarePlusOne(z):
return z*z + 1

def isSmallForever(z, f):
k = 0

while abs(z) < 2:
z = f(z)
k += 1

if k > 250:
return True

return False


This isSmallForever function is generic: you can give it any function $f$ and it will repeatedly call $f$ on $z$ until the result grows bigger than 2 in length. Note that the abs function is a built-in Python function for computing the length of a complex number.

Then I wrote a classify function, which you can give a window and a small increment, and it will produce a grid of zeros and ones marking the results of isSmallForever. The details of the function are not that important. I also wrote a function that turns the grid into a picture. So here’s an example of how we’d use it:

from plotcomplex.plot import gridToImage

def classifySquarePlusOne(z):
return isSmallForever(z, squarePlusOne)

grid = classify(classifySquarePlusOne) # the other arguments are defaulted to [-2,2], [-2,2], 0.1
gridToImage(grid)


And here’s the result. Points colored black grow beyond 2, and white points stay small for the whole test.

Looks like they’ll always grow big.

So it looks like repeated squaring plus one will always make complex numbers grow big. That’s not too exciting, but we can always make it more exciting. What happens if we replace the 1 in $z^2 + 1$ with a different complex number? For example, if we do $z^2 - 1$ then will things always grow big?

You can randomly guess and see that 0 will never grow big, because $0^2 - 1 = -1$ and $(-1)^2 - 1 = 0$. It will just oscillate forever. So with -1 some numbers will grow and some will not! Let’s use the same routine above to see which:

def classifySquareMinusOne(z):
return isSmallForever(z, squareMinusOne)

grid = classify(classifySquareMinusOne)
gridToImage(grid)


And the result:

Now that’s a more interesting picture! Let’s ramp up the resolution

grid = classify(classifySquareMinusOne, step=0.001)
gridToImage(grid)


Gorgeous. If you try this at home you’ll notice, however, that this took a hell of a long time to run. Speeding up our programs is very possible, but it’s a long story for another time. For now we can just be patient.

Indeed, this image has a ton of interesting details! It looks almost circular in the middle, but if we zoom in we can see that it’s more like a rippling wave

It’s pretty incredible, and a huge question is jumping out at me: what the heck is causing this pattern to occur? What secret does -1 know that +1 doesn’t that makes the resulting pattern so intricate?

But an even bigger question is this. We just discovered that some values of $c$ make $z \mapsto z^2 + c$ result in interesting patterns, and which values do not? Even if we just, say, fix the starting point to zero: what is the pattern in the complex numbers that would tell me when this transformation makes zero blow up, and when it keeps zero small?

Sounds like a job for another program. This time we’ll use a nice little Python feature called a closure, which we define a function that saves the information that exists when it’s created for later. It will let us write a function that takes in $c$ and produces a function that transforms according to $z \mapsto z^2+c$.

def squarePlusC(c):
def f(z):
return z*z + c

return f


And we can use the very same classification/graphing function from before to do this.

def classifySquarePlusC(c):
return isSmallForever(0, squarePlusC(c))

grid = classify(classifySquarePlusC, xRange=(-2, 1), yRange=(-1, 1), step=0.005)
gridToImage(grid)


And the result:

Stunning. This wonderful pattern, which is still largely not understood today, is known as the Mandelbrot set. That is, the white points are the points in the Mandlebrot set, and the black points are not in it. The detail on the border of this thing is infinitely intricate. For example, we can change the window in our little program to zoom in on a particular region.

And if you keep zooming in you keep getting more and more detail. This was true of the specific case of $z^2 - 1$, but somehow the patterns in the Mandelbrot set are much more varied and interesting. And if you keep going down eventually you’ll see patterns that look like the original Mandelbrot set. We can already kind of see that happening above. The name for this idea is a fractal, and the $z^2 - 1$ image has it too. Fractals are a fascinating and mysterious subject studied in a field called discrete dynamical systems. Many people dedicate their entire lives to studying these things, and it’s for good reason. There’s a lot to learn and even more that’s unknown!

So this is the end of our journey for now. I’ve posted all of the code we used in the making of this post so you can continue to play, but here are some interesting ideas.

• The Mandelbrot set (and most fractals) are usually colored. The way they’re colored is as follows. Rather than just say true or false when zero blows up beyond 2 in length, you return the number of iterations $k$ that happened. Then you pick a color based on how big $k$ is. There’s a link below that lets you play with this. In fact, adding colors shows that there is even more intricate detail happening outside the Mandelbrot set that’s too faint to see in our pictures above. Such as this.
• Some very simple questions about fractals are very hard to answer. For example, is the Mandelbrot set connected? That is, is it possible to “walk” from every point in the Mandelbrot set to every other point without leaving the set? Despite the scattering of points in the zoomed in picture above that suggest the answer is no, the answer is actually yes! This is a really difficult thing to prove, however.
• The patterns in many fractals are often used to generate realistic looking landscapes and generate pseudo randomness. So fractals are not just mathematical curiosities.
• You should definitely be experimenting with this stuff! What happens if you change the length threshold from 2 to some bigger number? What about a smaller number? What if you do powers different than $2$? There’s so much to explore!
• The big picture thing to take away from this is that it’s not the numbers themselves that are particularly interesting, it’s the transformations of the numbers that generate these patterns! The interesting questions are what kinds of things are the same under these transformations, and what things are different. This is a very general idea in mathematics, and the more math you do the more you’ll find yourself wondering about useful and bizarre transformations.

For the chance to keep playing with the Mandelbrot set, check out this Mandelbrot grapher that works in your browser. It lets you drag rectangles to zoom further in on regions of interest. It’s really fun.

Until next time!

# A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

## 3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$\displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $\wedge$ denotes a logical AND and the vee $\vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $x_i$ or the negation $\overline{x_i}$, and connectives $\wedge$ and $\vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $\vee$‘s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $G$, we’ll construct a set $S_G$ of labeled truth assignments, and we’ll define the distribution $D$ to be the uniform distribution over those truth assignments used in $S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $S_G$ exactly how we labeled them, and we set the allowed error $\varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

## The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $G$ with $n$ nodes $v_1, \dots, v_n$ and a set of $m$ undirected edges $E$, we construct a set of examples with positive labels $S^+$ and one with negative examples $S^-$. The examples are truth assignments to $n$ variables, which we label $x_1, \dots, x_n$, and we identify a truth assignment to the $\left \{ 0,1 \right \}$-valued vector $(x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $S^+$ are simple: for each $v_i$ add a truth assignment $x_i = T, x_j = F$ for $j \neq i$. I.e., the binary vector is $(1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $i$-th position.

The negative examples $S^-$ come from the edges. For each edge $(v_i, v_j) \in E$, we add the example with a zero in the $i$-th and $j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

Claim: $G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $\varphi$.

Again, consistent just means that $\varphi$ is satisfied by every truth assignment in $S^+$ and unsatisfied by every example in $S^-$. Since we chose our distribution to be uniform over $S^+ \cup S^-$, we don’t care what $\varphi$ does elsewhere.

Indeed, if $G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $T_R$ be the AND of all the literals $x_i$ for which vertex $v_i$ is not red. For each such $i$, the corresponding example in $S^+$ will satisfy $T_R$, because we put a zero in the $i$-th position and ones everywhere else. Similarly, no example in $S^-$ will make $T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $(v_1,v_2)$. Then the corresponding negative example is $(0,0,1)$. Unless both $v_1$ and $v_2$ are colored red, one of $x_1, x_2$ will have to be ANDed as part of $T_R$. But the example has a zero for both $x_1$ and $x_2$, so $T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $\varphi$. We need to construct a three coloring of $G$. It goes in largely the same way: label the clauses $\varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $v_i$ the color of the clause that is satisfied by the corresponding example in $S^+$. There must be some clause that does this because $\varphi$ is consistent with $S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $(v_i, v_j)$, and both $v_i$ and $v_j$ are colored, say, blue. Look at the clause $T_B$: since $v_i$ and $v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make $T_B$ true. Since those two positive examples differ in both their $i$-th and $j$-th positions, $T_B$ can’t have any of the literals $x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $T_B$ because it has 1’s everywhere except $i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $\delta < 1/2$ and $\varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $1 / |S^+ \cup S^-|$, which is too big.

## Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $\mathsf{C}$ over a set $X$ is efficiently PAC-learnable using the hypothesis class $\mathsf{H}$ if there exists an algorithm $A(\varepsilon, \delta)$ with access to a query function for $\mathsf{C}$ and runtime $O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $c \in \mathsf{C}$, all distributions $D$ over $X$, and all $0 < \delta , \varepsilon < 1/2$, the probability that $A$ produces a hypothesis $h \in \mathsf{H}$ with error at most $\varepsilon$ is at least $1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!

# Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

## The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability $p$. If $X$ denotes the number of trials needed to get the first success, then clearly $\Pr(X = k) = (1-p)^{k-1} p$ (since first we need $k-1$ failures which occur independently with probability $1-p$, then we need one success which happens with probability $p$). Thus the expected value of $X$ is

$\displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p$

by basic calculus. In particular, if success is defined as getting a six, then $p=1/6$ thus the expected time is $1/p=6$.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is $\frac{1}{26}$, thus the probability of a random eleven-letter word being ABRACADABRA is exactly $\left(\frac{1}{26}\right)^{11}$. So if typing 11 letters is one trial, the expected number of trials is

$\displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}$

which means $11\cdot 26^{11}$ keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that $11\cdot 26^{11}$ is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

## Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses$1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive$26 if he wins, since the probability of getting the next letter right is exactly $\frac{1}{26}$ (thus the expected value of the change in the gambler’s fortune is $\frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0$.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as $T$. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly $T$ dollars. How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins$26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins $\26^4$. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be $\26^{11}$. Thus our casino will have to give out $26^{11}+26^4+26$ dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is $T$ dollars, the expected value of our expenses is $26^{11}+26^4+26$ dollars, thus $E(T)=26^{11}+26^4+26$. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

## Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. $X_0$ will denote the gambler’s fortune before the game starts, $X_1$ the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of $X_n$, given $X_1, X_2, \ldots, X_{n-1}$ is the same as $X_{n-1}$. This can be written as $E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1}$ or, equivalently, $E(X_n - X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0$.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

## The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let $X'_n = X_n$ if $n < T$ and $X'_n = X_T$ if $n \ge T$ where $T$ denotes the stopping time, i.e. the time at which the desired event occurs. Notice that $T$ itself is a random variable.

We require our stopping time $T$ to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let $X_n$ be a martingale stopped at step $T$, and suppose one of the following three conditions hold:

1. The stopping time $T$ is almost surely bounded by some constant;
2. The stopping time $T$ is almost surely finite and every step of the stopped martingale $X_n$ is almost surely bounded by some constant; or
3. The expected stopping time $E(T)$ is finite and the absolute value of the martingale increments $|X_n-X_{n-1}|$ are almost surely bounded by a constant.

Then $E(X_T) = E(X_0).$

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than $26^{12}$) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by $26^{11}+26^4+26$. This shows that our solution is indeed correct.

## Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has $a$ dollars, the second player has $b$ dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let $X_n$ denote the change in the second player’s fortune, and set $X_0 = 0$. Let $T_k$ denote the first time $s$ when $X_s = k$. Then our first question can be formalized as trying to determine $\Pr(T_{-b} < T_a)$. Let $t = \min \{ T_{-b}, T_a\}$. Clearly $t$ is a stopping time. By the optional stopping theorem we have that

$\displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))$

thus $\Pr(T_{-b} < T_a)=\frac{a}{a+b}$.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: $X_n^2-n$ is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

## A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with $k$-SAT for some $k \ne 3$ instead? Clearly the hardness of the problem is monotone increasing in $k$ since $k$-SAT is a special case of $(k+1)$-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as $k$-SAT for any $k>3$. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals $x$ and $y$ are connected by a directed edge if there is a clause $(\bar x \lor y)$. Recall that $\bar x \lor y$ is equivalent to $x \implies y$, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths $x \to \bar x$ and $\bar x \to x$ (since $x \Leftrightarrow \bar x$ is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after $O(n^2)$ rounds where $n$ denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in $O(n^2)$ steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by $\pm 1$ in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least $\frac{1}{2}$. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond $n$. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number $n$, on the other hand, is called a reflecting barrier: we cannot reach $n+1$, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in $O(n^2)$ steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

# (Finite) Fields — A Primer

So far on this blog we’ve given some introductory notes on a few kinds of algebraic structures in mathematics (most notably groups and rings, but also monoids). Fields are the next natural step in the progression.

If the reader is comfortable with rings, then a field is extremely simple to describe: they’re just commutative rings with 0 and 1, where every nonzero element has a multiplicative inverse. We’ll give a list of all of the properties that go into this “simple” definition in a moment, but an even more simple way to describe a field is as a place where “arithmetic makes sense.” That is, you get operations for $+,-, \cdot , /$ which satisfy the expected properties of addition, subtraction, multiplication, and division. So whatever the objects in your field are (and sometimes they are quite weird objects), they behave like usual numbers in a very concrete sense.

So here’s the official definition of a field. We call a set $F$ a field if it is endowed with two binary operations addition ($+$) and multiplication ($\cdot$, or just symbol juxtaposition) that have the following properties:

• There is an element we call 0 which is the identity for addition.
• Addition is commutative and associative.
• Every element $a \in F$ has a corresponding additive inverse $b$ (which may equal $a$) for which $a + b = 0$.

These three properties are just the axioms of a (commutative) group, so we continue:

• There is an element we call 1 (distinct from 0) which is the identity for multiplication.
• Multiplication is commutative and associative.
• Every nonzero element $a \in F$ has a corresponding multiplicative inverse $b$ (which may equal $a$) for which $ab = 1$.
• Addition and multiplication distribute across each other as we expect.

If we exclude the existence of multiplicative inverses, these properties make $F$ a commutative ring, and so we have the following chain of inclusions that describes it all

$\displaystyle \textup{Fields} \subset \textup{Commutative Rings} \subset \textup{Rings} \subset \textup{Commutative Groups} \subset \textup{Groups}$

The standard examples of fields are the real numbers $\mathbb{R}$, the rationals $\mathbb{Q}$, and the complex numbers $\mathbb{C}$. But of course there are many many more. The first natural question to ask about fields is: what can they look like?

For example, can there be any finite fields? A field $F$ which as a set has only finitely many elements?

As we saw in our studies of groups and rings, the answer is yes! The simplest example is the set of integers modulo some prime $p$. We call them $\mathbb{Z} / p \mathbb{Z},$ or sometimes just $\mathbb{Z}/p$ for short, and let’s rederive what we know about them now.

As a set, $\mathbb{Z}/p$ consists of the integers $\left \{ 0, 1, \dots, p-1 \right \}$. The addition and multiplication operations are easy to define, they’re just usual addition and multiplication followed by a modulus. That is, we add by $a + b \mod p$ and multiply with $ab \mod p$. This thing is clearly a commutative ring (because the integers form a commutative ring), so to show this is a field we need to show that everything has a multiplicative inverse.

There is a nice fact that allows us to do this: an element $a$ has an inverse if and only if the only way for it to divide zero is the trivial way $0a = 0$. Here’s a proof. For one direction, suppose $a$ divides zero nontrivially, that is there is some $c \neq 0$ with $ac = 0$. Then if $a$ had an inverse $b$, then $0 = b(ac) = (ba)c = c$, but that’s very embarrassing for $c$ because it claimed to be nonzero. Now suppose $a$ only divides zero in the trivial way. Then look at all possible ways to multiply $a$ by other nonzero elements of $F$. No two can give you the same result because if $ax = ay$ then (without using multiplicative inverses) $a(x-y) = 0$, but we know that $a$ can only divide zero in the trivial way so $x=y$. In other words, the map “multiplication by $a$” is injective. Because the set of nonzero elements of $F$ is finite you have to hit everything (the map is in fact a bijection), and some $x$ will give you $ax = 1$.

Now let’s use this fact on $\mathbb{Z}/p$ in the obvious way. Since $p$ is a prime, there are no two smaller numbers $a, b < p$ so that $ab = p$. But in $\mathbb{Z}/p$ the number $p$ is equivalent to zero (mod $p$)! So $\mathbb{Z}/p$ has no nontrivial zero divisors, and so every element has an inverse, and so it’s a finite field with $p$ elements.

The next question is obvious: can we get finite fields of other sizes? The answer turns out to be yes, but you can’t get finite fields of any size. Let’s see why.

## Characteristics and Vector Spaces

Say you have a finite field $k$ (lower-case k is the standard letter for a field, so let’s forget about $F$). Beacuse the field is finite, if you take 1 and keep adding it to itself you’ll eventually run out of field elements. That is, $n = 1 + 1 + \dots + 1 = 0$ at some point. How do I know it’s zero and doesn’t keep cycling never hitting zero? Well if at two points $n = m \neq 0$, then $n-m = 0$ is a time where you hit zero, contradicting the claim.

Now we define $\textup{char}(k)$, the characteristic of $k$, to be the smallest $n$ (sums of 1 with itself) for which $n = 0$. If there is no such $n$ (this can happen if $k$ is infinite, but doesn’t always happen for infinite fields), then we say the characteristic is zero. It would probably make more sense to say the characteristic is infinite, but that’s just the way it is. Of course, for finite fields the characteristic is always positive. So what can we say about this number? We have seen lots of example where it’s prime, but is it always prime? It turns out the answer is yes!

For if $ab = n = \textup{char}(k)$ is composite, then by the minimality of $n$ we get $a,b \neq 0$, but $ab = n = 0$. This can’t happen by our above observation, because being a zero divisor means you have no inverse! Contradiction, sucker.

But it might happen that there are elements of $k$ that can’t be written as $1 + 1 + \dots + 1$ for any number of terms. We’ll construct examples in a minute (in fact, we’ll classify all finite fields), but we already have a lot of information about what those fields might look like. Indeed, since every field has 1 in it, we just showed that every finite field contains a smaller field (a subfield) of all the ways to add 1 to itself. Since the characteristic is prime, the subfield is a copy of $\mathbb{Z}/p$ for $p = \textup{char}(k)$. We call this special subfield the prime subfield of $k$.

The relationship between the possible other elements of $k$ and the prime subfield is very neat. Because think about it: if $k$ is your field and $F$ is your prime subfield, then the elements of $k$ can interact with $F$ just like any other field elements. But if we separate $k$ from $F$ (make a separate copy of $F$), and just think of $k$ as having addition, then the relationship with $F$ is that of a vector space! In fact, whenever you have two fields $k \subset k'$, the latter has the structure of a vector space over the former.

Back to finite fields, $k$ is a vector space over its prime subfield, and now we can impose all the power and might of linear algebra against it. What’s it’s dimension? Finite because $k$ is a finite set! Call the dimension $m$, then we get a basis $v_1, \dots, v_m$. Then the crucial part: every element of $k$ has a unique representation in terms of the basis. So they are expanded in the form

$\displaystyle f_1v_1 + \dots + f_mv_m$

where the $f_i$ come from $F$. But now, since these are all just field operations, every possible choice for the $f_i$ has to give you a different field element. And how many choices are there for the $f_i$? Each one has exactly $|F| = \textup{char}(k) = p$. And so by counting we get that $k$ has $p^m$ many elements.

This is getting exciting quickly, but we have to pace ourselves! This is a constraint on the possible size of a finite field, but can we realize it for all choices of $p, m$? The answer is again yes, and in the next section we’ll see how.  But reader be warned: the formal way to do it requires a little bit of familiarity with ideals in rings to understand the construction. I’ll try to avoid too much technical stuff, but if you don’t know what an ideal is, you should expect to get lost (it’s okay, that’s the nature of learning new math!).

## Constructing All Finite Fields

Let’s describe a construction. Take a finite field $k$ of characteristic $p$, and say you want to make a field of size $p^m$. What we need to do is construct a field extension, that is, find a bigger field containing $k$ so that the vector space dimension of our new field over $k$ is exactly $m$.

What you can do is first form the ring of polynomials with coefficients in $k$. This ring is usually denoted $k[x]$, and it’s easy to check it’s a ring (polynomial addition and multiplication are defined in the usual way). Now if I were speaking to a mathematician I would say, “From here you take an irreducible monic polynomial $p(x)$ of degree $m$, and quotient your ring by the principal ideal generated by $p$. The result is the field we want!”

In less compact terms, the idea is exactly the same as modular arithmetic on integers. Instead of doing arithmetic with integers modulo some prime (an irreducible integer), we’re doing arithmetic with polynomials modulo some irreducible polynomial $p(x)$. Now you see the reason I used $p$ for a polynomial, to highlight the parallel thought process. What I mean by “modulo a polynomial” is that you divide some element $f$ in your ring by $p$ as much as you can, until the degree of the remainder is smaller than the degree of $p(x)$, and that’s the element of your quotient. The Euclidean algorithm guarantees that we can do this no matter what $k$ is (in the formal parlance, $k[x]$ is called a Euclidean domain for this very reason). In still other words, the “quotient structure” tells us that two polynomials $f, g \in k[x]$ are considered to be the same in $k[x] / p$ if and only if $f - g$ is divisible by $p$. This is actually the same definition for $\mathbb{Z}/p$, with polynomials replacing numbers, and if you haven’t already you can start to imagine why people decided to study rings in general.

Let’s do a specific example to see what’s going on. Say we’re working with $k = \mathbb{Z}/3$ and we want to compute a field of size $27 = 3^3$. First we need to find a monic irreducible polynomial of degree $3$. For now, I just happen to know one: $p(x) = x^3 - x + 1$. In fact, we can check it’s irreducible, because to be reducible it would have to have a linear factor and hence a root in $\mathbb{Z}/3$. But it’s easy to see that if you compute $p(0), p(1), p(2)$ and take (mod 3) you never get zero.

So I’m calling this new ring

$\displaystyle \frac{\mathbb{Z}/3[x]}{(x^3 - x + 1)}$

It happens to be a field, and we can argue it with a whole lot of ring theory. First, we know an irreducible element of this ring is also prime (because the ring is a unique factorization domain), and prime elements generate maximal ideals (because it’s a principal ideal domain), and if you quotient by a maximal ideal you get a field (true of all rings).

But if we want to avoid that kind of argument and just focus on this ring, we can explicitly construct inverses. Say you have a polynomial $f(x)$, and for illustration purposes we’ll choose $f(x) = x^4 + x^2 - 1$. Now in the quotient ring we could do polynomial long division to find remainders, but another trick is just to notice that the quotient is equivalent to the condition that $x^3 = x - 1$. So we can reduce $f(x)$ by applying this rule to $x^4 = x^3 x$ to get

$\displaystyle f(x) = x^2 + x(x-1) - 1 = 2x^2 - x - 1$

Now what’s the inverse of $f(x)$? Well we need a polynomial $g(x) = ax^2 + bx + c$ whose product with $f$ gives us something which is equivalent to 1, after you reduce by $x^3 - x + 1$. A few minutes of algebra later and you’ll discover that this is equivalent to the following polynomial being identically 1

$\displaystyle (a-b+2c)x^2 + (-3a+b-c)x + (a - 2b - 2c) = 1$

In other words, we get a system of linear equations which we need to solve:

\displaystyle \begin{aligned} a & - & b & + & 2c & = 0 \\ -3a & + & b & - & c &= 0 \\ a & - & 2b & - & 2c &= 1 \end{aligned}

And from here you can solve with your favorite linear algebra techniques. This is a good exercise for working in fields, because you get to abuse the prime subfield being characteristic 3 to say terrifying things like $-1 = 2$ and $6b = 0$. The end result is that the inverse polynomial is $2x^2 + x + 1$, and if you were really determined you could write a program to compute these linear systems for any input polynomial and ensure they’re all solvable. We prefer the ring theoretic proof.

In any case, it’s clear that taking a polynomial ring like this and quotienting by a monic irreducible polynomial gives you a field. We just control the size of that field by choosing the degree of the irreducible polynomial to our satisfaction. And that’s how we get all finite fields!

## One Last Word on Irreducible Polynomials

One thing we’ve avoided is the question of why irreducible monic polynomials exist of all possible degrees $m$ over any $\mathbb{Z}/p$ (and as a consequence we can actually construct finite fields of all possible sizes).

The answer requires a bit of group theory to prove this, but it turns out that the polynomial $x^{p^m} - x$ has all degree $m$ monic irreducible polynomials as factors. But perhaps a better question (for computer scientists) is how do we work over a finite field in practice? One way is to work with polynomial arithmetic as we described above, but this has some downsides: it requires us to compute these irreducible monic polynomials (which doesn’t sound so hard, maybe), to do polynomial long division every time we add, subtract, or multiply, and to compute inverses by solving a linear system.

But we can do better for some special finite fields, say where the characteristic is 2 (smells like binary) or we’re only looking at $F_{p^2}$. The benefit there is that we aren’t forced to use polynomials. We can come up with some other kind of structure (say, matrices of a special form) which happens to have the same field structure and makes computing operations relatively painless. We’ll see how this is done in the future, and see it applied to cryptography when we continue with our series on elliptic curve cryptography.

Until then!