# Methods of Proof — Contrapositive

In this post we’ll cover the second of the “basic four” methods of proof: the contrapositive implication. We will build off our material from last time and start by defining functions on sets.

## Functions as Sets

So far we have become comfortable with the definition of a set, but the most common way to use sets is to construct functions between them. As programmers we readily understand the nature of a function, but how can we define one mathematically? It turns out we can do it in terms of sets, but let us recall the desired properties of a function:

• Every input must have an output.
• Every input can only correspond to one output (the functions must be deterministic).

One might try at first to define a function in terms of subsets of size two. That is, if $A, B$ are sets then a function $f: A \to B$ would be completely specified by

$\displaystyle \left \{ \left \{ x, y \right \} : x \in A, y \in B \right \}$

where to enforce those two bullets, we must impose the condition that every $x \in A$ occurs in one and only one of those subsets. Notationally, we would say that $y = f(x)$ means $\left \{ x, y \right \}$ is a member of the function. Unfortunately, this definition fails miserably when $A = B$, because we have no way to distinguish the input from the output.

To compensate for this, we introduce a new type of object called a tuple. A tuple is just an ordered list of elements, which we write using round brackets, e.g. $(a,b,c,d,e)$.

As a quick aside, one can define ordered tuples in terms of sets. We will leave the reader to puzzle why this works, and generalize the example provided:

$\displaystyle (a,b) = \left \{ a, \left \{ a, b \right \} \right \}$

And so a function $f: A \to B$ is defined to be a list of ordered pairs where the first thing in the pair is an input and the second is an output:

$\displaystyle f = \left \{ (x, y) : x \in A, y \in B \right \}$

Subject to the same conditions, that each $x$ value from $A$ must occur in one and only one pair. And again by way of notation we say $y = f(x)$ if the pair $(x,y)$ is a member of $f$ as a set. Note that the concept of a function having “input and output” is just an interpretation. A function can be viewed independent of any computational ideas as just a set of pairs. Often enough we might not even know how to compute a function (or it might be provably uncomputable!), but we can still work with it abstractly.

It is also common to call functions “maps,” and to define “map” to mean a special kind of function (that is, with extra conditions) depending on the mathematical field one is working in. Even in other places on this blog, “map” might stand for a continuous function, or a homomorphism. Don’t worry if you don’t know these terms off hand; they are just special cases of functions as we’ve defined them here. For the purposes of this series on methods of proof, “function” and “map” and “mapping” mean the same thing: regular old functions on sets.

## Injections

One of the most important and natural properties of a function is that of injectivity.

Definition: A function $f: A \to B$ is an injection if whenever $a \neq a'$ are distinct members of $A$, then $f(a) \neq f(a')$. The adjectival version of the word injection is injective.

As a quick side note, it is often the convention for mathematicians to use a capital letter to denote a set, and a lower-case letter to denote a generic element of that set. Moreover, the apostrophe on the $a'$ is called a prime (so $a'$ is spoken, “a prime”), and it’s meant to denote a variation on the non-prime’d variable $a$ in some way. In this case, the variation is that $a' \neq a$.

So even if we had not explicitly mentioned where the $a, a'$ objects came from, the knowledgeable mathematician (which the reader is obviously becoming) would be reasonably certain that they come from $A$. Similarly, if I were to lackadaisically present $b$ out of nowhere, the reader would infer it must come from $B$.

One simple and commonly used example of an injection is the so-called inclusion function. If $A \subset B$ are sets, then there is a canonical function representing this subset relationship, namely the function $i: A \to B$ defined by $i(a) = a$. It should be clear that non-equal things get mapped to non-equal things, because the function doesn’t actually do anything except change perspective on where the elements are sitting: two nonequal things sitting in $A$ are still nonequal in $B$.

Another example is that of multiplication by two as a map on natural numbers. More rigorously, define $f: \mathbb{N} \to \mathbb{N}$ by $f(x) = 2x$. It is clear that whenever $x \neq y$ as natural numbers then $2x \neq 2y$. For one, $x, y$ must have differing prime factorizations, and so must $2x, 2y$ because we added the same prime factor of 2 to both numbers. Did you catch the quick proof by direct implication there? It was sneaky, but present.

Now the property of being an injection can be summed up by a very nice picture:

A picture example of an injective function.

The arrows above represent the pairs $(x,f(x))$, and the fact that no two arrows end in the same place makes this function an injection. Indeed, drawing pictures like this can give us clues about the true nature of a proposed fact. If the fact is false, it’s usually easy to draw a picture like this showing so. If it’s true, then the pictures will support it and hopefully make the proof obvious. We will see this in action in a bit (and perhaps we should expand upon it later with a post titled, “Methods of Proof — Proof by Picture”).

There is another, more subtle concept associated with injectivity, and this is where its name comes from. The word “inject” gives one the mental picture that we’re literally placing one set $A$ inside another set $B$ without changing the nature of $A$. We are simply realizing it as being inside of $B$, perhaps with different names for its elements. This interpretation becomes much clearer when one investigates sets with additional structure, such as groups, rings, or topological spaces. Here the word “injective mapping” much more literally means placing one thing inside another without changing the former’s structure in any way except for relabeling.

In any case, mathematicians have the bad (but time-saving) habit of implicitly identifying a set with its image under an injective mapping. That is, if $f :A \to B$ is an injective function, then one can view $A$ as the same thing as $f(A) \subset B$. That is, they have the same elements except that $f$ renames the elements of $A$ as elements of $B$. The abuse comes in when they start saying $A \subset B$ even when this is not strictly the case.

Here is an example of this abuse that many programmers commit without perhaps noticing it. Suppose $X$ is the set of all colors that can be displayed on a computer (as an abstract set; the elements are “this particular green,” “that particular pinkish mauve”). Now let $Y$ be the set of all finite hexadecimal numbers. Then there is an obvious injective map from $X \to Y$ sending each color to its 6-digit hex representation. The lazy mathematician would say “Well, then, we might as well say $X \subset Y$, for this is the obvious way to view $X$ as a set of hexadecimal numbers.” Of course there are other ways (try to think of one, and then try to find an infinite family of them!), but the point is that this is the only way that anyone really uses, and that the other ways are all just “natural relabelings” of this way.

The precise way to formulate this claim is as follows, and it holds for arbitrary sets and arbitrary injective functions. If $g, g': X \to Y$ are two such ways to inject $X$ inside of $Y$, then there is a function $h: Y \to Y$ such that the composition $hg$ is precisely the map $g'$. If this is mysterious, we have some methods the reader can use to understand it more fully: give examples for simplified versions (what if there were only three colors?), draw pictures of “generic looking” set maps, and attempt a proof by direct implication.

## Proof by Contrapositive

Often times in mathematics we will come across a statement we want to prove that looks like this:

If X does not have property A, then Y does not have property B.

Indeed, we already have: to prove a function $f: X \to Y$ is injective we must prove:

If x is not equal to y, then f(x) is not equal to f(y).

A proof by direct implication can be quite difficult because the statement gives us very little to work with. If we assume that $X$ does not have property $A$, then we have nothing to grasp and jump-start our proof. The main (and in this author’s opinion, the only) benefit of a proof by contrapositive is that one can turn such a statement into a constructive one. That is, we can write “p implies q” as “not q implies not p” to get the equivalent claim:

If Y has property B then X has property A.

This rewriting is called the “contrapositive form” of the original statement. It’s not only easier to parse, but also probably easier to prove because we have something to grasp at from the beginning.

To the beginning mathematician, it may not be obvious that “if p then q” is equivalent to “if not q then not p” as logical statements. To show that they are requires a small detour into the idea of a “truth table.”

In particular, we have to specify what it means for “if p then q” to be true or false as a whole. There are four possibilities: p can be true or false, and q can be true or false. We can write all of these possibilities in a table.

p  q
T  T
T  F
F  T
F  F

If we were to complete this table for “if p then q,” we’d have to specify exactly which of the four cases correspond to the statement being true. Of course, if the p part is true and the q part is true, then “p implies q” should also be true. We have seen this already in proof by direct implication. Next, if p is true and q is false, then it certainly cannot be the case that truth of p implies the truth of q. So this would be a false statement. Our truth table so far looks like

p  q  p->q
T  T   T
T  F   F
F  T   ?
F  F   ?

The next question is what to do if the premise p of “if p then q” is false. Should the statement as a whole be true or false? Rather then enter a belated philosophical discussion, we will zealously define an implication to be true if its hypothesis is false. This is a well-accepted idea in mathematics called vacuous truth. And although it seems to make awkward statements true (like “if 2 is odd then 1 = 0″), it is rarely a confounding issue (and more often forms the punchline of a few good math jokes). So we can complete our truth table as follows

p q  p->q
T T   T
T F   F
F T   T
F F   T

Now here’s where contraposition comes into play. If we’re interested in determining when “not q implies not p” is true, we can add these to the truth table as extra columns:

p  q  p->q  not q   not p   not q -> not p
T  T   T      F       F           T
T  F   F      T       F           F
F  T   T      F       T           T
F  F   T      T       T           T

As we can see, the two columns corresponding to “p implies q” and “not q implies not p” assume precisely the same truth values in all possible scenarios. In other words, the two statements are logically equivalent.

And so our proof technique for contrapositive becomes: rewrite the statement in its contrapositive form, and proceed to prove it by direct implication.

## Examples and Exercises

Our first example will be completely straightforward and require nothing but algebra. Let’s show that the function $f(x) = 7x - 4$ is injective. Contrapositively, we want to prove that if $f(x) = f(x')$ then $x = x'$. Assuming the hypothesis, we start by supposing $7x - 4 = 7x' - 4$. Applying algebra, we get $7x = 7x'$, and dividing by 7 shows that $x = x’$ as desired. So $f$ is injective.

This example is important because if we tried to prove it directly, we might make the mistake of assuming algebra works with $\neq$ the same way it does with equality. In fact, many of the things we take for granted about equality fail with inequality (for instance, if $a \neq b$ and $b \neq c$ it need not be the case that $a \neq c$). The contrapositive method allows us to use our algebraic skills in a straightforward way.

Next let’s prove that the composition of two injective functions is injective. That is, if $f: X \to Y$ and $g: Y \to Z$ are injective functions, then the composition $gf : X \to Z$  defined by $gf(x) = g(f(x))$ is injective.

In particular, we want to prove that if $x \neq x'$ then $g(f(x)) \neq g(f(x'))$. Contrapositively, this is the same as proving that if $g(f(x)) = g(f(x'))$ then $x=x'$. Well by the fact that $g$ is injective, we know that (again contrapositively) whenever $g(y) = g(y')$ then $y = y'$, so it must be that $f(x) = f(x')$. But by the same reasoning $f$ is injective and hence $x = x'$. This proves the statement.

This was a nice symbolic proof, but we can see the same fact in a picturesque form as well:

A composition of two injections is an injection.

If we maintain that any two arrows in the diagram can’t have the same head, then following two paths starting at different points in $X$ will never land us at the same place in $Z$. Since $f$ is injective we have to travel to different places in $Y$, and since $g$ is injective we have to travel to different places in $Z$. Unfortunately, this proof cannot replace the formal one above, but it can help us understand it from a different perspective (which can often make or break a mathematical idea).

Expanding upon this idea we give the reader a challenge: Let $A, B, C$ be finite sets of the same size. Prove or disprove that if $f: A \to B$ and $g: B \to C$ are (arbitrary) functions, and if the composition $gf$ is injective, then both of $f, g$ must be injective.

Another exercise which has a nice contrapositive proof: prove that if $A,B$ are finite sets and $f:A \to B$ is an injection, then $A$ has at most as many elements as $B$. This one is particularly susceptible to a “picture proof” like the one above. Although the formal the formal name for the fact one uses to prove this is the pigeonhole principleit’s really just a simple observation.

Aside from inventing similar exercises with numbers (e.g., if $ab$ is odd then $a$ is odd or $b$ is odd), this is all there is to the contrapositive method. It’s just a direct proof disguised behind a fact about truth tables. Of course, as is usual in more advanced mathematical literature, authors will seldom announce the use of contraposition. The reader just has to be watchful enough to notice it.

Though we haven’t talked about either the real numbers $\mathbb{R}$ nor proofs of existence or impossibility, we can still pose this interesting question: is there an injective function from $\mathbb{R} \to \mathbb{N}$? In truth there is not, but as of yet we don’t have the proof technique required to show it. This will be our next topic in the series: the proof by contradiction.

Until then!

# Why there is no Hitchhiker’s Guide to Mathematics for Programmers

For those who aren’t regular readers: as a followup to this post, there are four posts detailing the basic four methods of proof, with intentions to detail some more advanced proof techniques in the future. You can find them on this blog’s primers page.

## Do you really want to get better at mathematics?

Remember when you first learned how to program? I do. I spent two years experimenting with Java programs on my own in high school. Those two years collectively contain the worst and most embarrassing code I have ever written. My programs absolutely reeked of programming no-nos. Hundred-line functions and even thousand-line classes, magic numbers, unreachable blocks of code, ridiculous code comments, a complete disregard for sensible object orientation, negligence of nearly all logic, and type-coercion that would make your skin crawl. I committed every naive mistake in the book, and for all my obvious shortcomings I considered myself a hot-shot programmer! At least I was learning a lot, and I was a hot-shot programmer in a crowd of high-school students interested in game programming.

Even after my first exposure and my commitment to get a programming degree in college, it was another year before I knew what a stack frame or a register was, two more before I was anywhere near competent with a terminal, three more before I fully appreciated functional programming, and to this day I still have an irrational fear of networking and systems programming (the first time I manually edited the call stack I couldn’t stop shivering with apprehension and disgust at what I was doing).

I just made this function call return to a *different* place than where it was called from.

In a class on C++ programming I was programming a Checkers game, and my task at the moment was to generate a list of all possible jump-moves that could be made on a given board. This naturally involved a depth-first search and a couple of recursive function calls, and once I had something I was pleased with, I compiled it and ran it on my first non-trivial example. Low and behold (even having followed test-driven development!), I was hit hard in the face by a segmentation fault. It took hundreds of test cases and more than twenty hours of confusion before I found the error: I was passing a reference when I should have been passing a pointer. In particular (and this is the aggravating part, as most programmers know), the fix required the change of about 4 characters. Twenty hours of work for four characters! Once I begrudgingly verified it worked (of course it worked, it was so obvious in hindsight), I promptly took the rest of the day off to play Starcraft.

Of course, as every code-savvy reader will agree, all of this drama is part of the process of becoming and strong programmer. One must study the topics incrementally, make plentiful mistakes and learn from them, and spend uncountably many hours in a state of stuporous befuddlement before one can be considered an experienced coder. This gives rise to all sorts of programmer culture, unix jokes, and reverence for the masters of C that make the programming community so lovely to be a part of. It’s like a secret club where you know all the handshakes. And should you forget one, a crafty use of awk and sed will suffice.

“Semicolons of Fury” was the name of my programming team in the ACM collegiate programming contest. We placed Cal Poly third in the Southern California Regionals, and in my opinion our success was due in large part to the dynamics of our team. I (center, in blue) have since gotten a more stylish haircut.

Now imagine someone comes along and says,

“I’m really interested in learning to code, but I don’t plan to write any programs and I absolutely abhor tracing program execution. I just want to use applications that others have written, like Chrome and iTunes.”

You would laugh at them! And the first thing that would pass through your mind is either, “This person would give up programming after the first twenty minutes,” or “I would be doing the world a favor by preventing this person from ever writing a program. This person belongs in some other profession.” This lies in stark opposition to the common chorus that everyone should learn programming. After all, it’s a constructive way to think about problem solving and a highly employable skill. In today’s increasingly technological world, it literally pays to know your your computer better than a web browser. (Ironically, I’m writing this on my Chromebook, but in my defense it has a terminal with ssh. Perhaps more ironically, all of my real work is done with paper and pencil.)

Unfortunately this sentiment is mirrored among most programmers who claim to be interested in mathematics. Mathematics is fascinating and useful and doing it makes you smarter and better at problem solving. But a lot of programmers think they want to do mathematics, and they either don’t know what “doing mathematics” means, or they don’t really mean they want to do mathematics. The appropriate translation of the above quote for mathematics is:

“Mathematics is useful and I want to be better at it, but I won’t write any original proofs and I absolutely abhor reading other people’s proofs. I just want to use the theorems others have proved, like Fermat’s Last Theorem and the undecidability of the Halting Problem.”

Of course no non-mathematician is really going to understand the current proof of Fermat’s Last Theorem, just as no fledgling programmer is going to attempt to write a (quality) web browser. The point is that the sentiment is in the wrong place. Mathematics is cousin to programming in terms of the learning curve, obscure culture, and the amount of time one spends confused. And mathematics is as much about writing proofs as software development is about writing programs (it’s not everything, but without it you can’t do anything). Honestly, it sounds ridiculously obvious to say it directly like this, but the fact remains that people feel like they can understand the content of mathematics without being able to write or read proofs.

I want to devote the rest of this post to exploring some of the reasons why this misconception exists. My main argument is that the reasons have to do more with the culture of mathematics than the actual difficulty of the subject. Unfortunately as of the time of this writing I don’t have a proposed “solution.” And all I can claim is a problem is that programmers can have mistaken views of what mathematics involves. I don’t propose a way to make mathematics easier for programmers, although I do try to make the content on my blog as clear as possible (within reason). I honestly do believe that the struggle and confusion builds mathematical character, just as the arduous bug-hunt builds programming character. If you want to be good at mathematics, there is no other way.

All I want to do with this article is to detail why mathematics can be so hard for beginners, to explain a few of the secret handshakes, and hopefully to bring an outsider a step closer to becoming an insider. So read on, and welcome to the community.

## Travelling far and wide

Perhaps one of the most prominent objections to devoting a lot of time to mathematics is that it can be years before you ever apply mathematics to writing programs. On one hand, this is an extremely valid concern. If you love writing programs and designing software, then mathematics is nothing more than a tool to help you write better programs.

But on the other hand, the very nature of mathematics is what makes it so applicable, and the only way to experience nature is to ditch the city entirely. Indeed, I provide an extended example of this in my journalesque post on introducing graph theory to high school students: the point of the whole exercise is to filter out the worldly details and distill the problem into a pristine mathematical form. Only then can we see its beauty and wide applicability.

Here is a more concrete example. Suppose you were trying to encrypt the contents of a message so that nobody could read it even if they intercepted the message in transit. Your first ideas would doubtlessly be the same as those of our civilization’s past: substitution ciphers, Vigenere ciphers, the Enigma machine, etc. Regardless of what method you come up with, your first thought would most certainly not be, “prime numbers so big they’ll make your pants fall down.” Of course, the majority of encryption methods today rely on very deep facts (or rather, conjectures) about prime numbers and other mathematical objects (“group presentations so complicated they’ll orient your Mobius band,” anyone?). But it took hundreds of years of number theory to get there, and countless deviations into other fields and dead-ends.

Of course there are other examples much closer to contemporary fashionable programming techniques. One such example is boosting. While we have yet to investigate boosting on this blog, the basic idea is that one can combine a bunch of algorithms which perform just barely better than 50% accuracy, and collectively they will be arbitrarily close to perfect. In a field dominated by practical applications, this result is purely the product of mathematical analysis.

And of course boosting in turn relies on the mathematics of probability theory, which in turn relies on set theory and measure theory, which in turn relies on real analysis, and so on. One could get lost for a lifetime in this mathematical landscape! And indeed, the best way to get a good view of it all is to start at the bottom. To learn mathematics from scratch. The working programmer simply doesn’t have time for that.

## What is it really, that people have such a hard time learning?

Most of the complaints about mathematics come understandably from notation and abstraction. And while I’ll have more to say on that below, I’m fairly certain that the main obstacle is a familiarity with the basic methods of proof.

While methods of proof are semantical by nature, in practice they form a scaffolding for all of mathematics, and as such one could better characterize them as syntactical. I’m talking, of course, about the four basics: direct implication, proof by contradiction, contrapositive, and induction. These are the loops, if statements, pointers, and structs of rigorous argument, and there is simply no way to understand the mathematics without a native fluency in this language.

The “Math Major Sloth” is fluent. Why aren’t you?

So much of mathematics is built up by chaining together a multitude of absolutely trivial statements which are amendable to proof by the basic four. I’m not kidding when I say they are absolutely trivial. A professor of mine once said,

If it’s not completely trivial, then it’s probably not true.

I can’t agree more with this statement. Of course, there are many sophisticated proofs in mathematics, but an overwhelming majority of (very important) facts fall in the trivial category. That being said, trivial can be sometimes relative to one’s familiarity with a subject, but that doesn’t make the sentiment any less right. Drawing up a shopping list is trivial once you’re comfortable with a pencil and paper and you know how to write (and you know what the words mean). There are certainly works of writing that require a lot more than what it takes to write a shopping list. Likewise, when we say something is trivial in mathematics, it’s because there’s no content to the proof outside of using definitions and a typical application of the basic four methods of proof. This is the “holding a pencil” part of writing a shopping list.

And as you probably know, there are many many more methods of proof than just the basic four. Proof by construction, by exhaustion, case analysis, and even picture proofs have a place in all fields of mathematics. More relevantly for programmers, there are algorithm termination proofs, probabilistic proofs, loop invariants to design and monitor, and the ubiquitous NP-hardness proofs (I’m talking about you, Travelling Salesman Problem!). There are many books dedicated to showcasing such techniques, and rightly so. Clever proofs are what mathematicians strive for above all else, and once a clever proof is discovered, the immediate first step is to try to turn it into a general method for proving other facts. Fully flushing out such a process (over many years, showcasing many applications and extensions) is what makes one a world-class mathematician.

An entire book dedicated to the probabilistic method of proof, invented by Paul Erdős and sown into the soil of mathematics over the course of his lifetime.

Another difficulty faced by programmers new to mathematics is the inability to check your proof absolutely. With a program, you can always write test cases and run them to ensure they all pass. If your tests are solid and plentiful, the computer will catch your mistakes and you can go fix them.

There is no corresponding “proof checker” for mathematics. There is no compiler to tell you that it’s nonsensical to construct the set of all sets, or that it’s a type error to quotient a set by something that’s not an equivalence relation. The only way to get feedback is to seek out other people who do mathematics and ask their opinion. In solo, mathematics involves a lot of backtracking, revising mistaken assumptions, and stretching an idea to its breaking point to see that it didn’t even make sense to begin with. This is “bug hunting” in mathematics, and it can often completely destroy a proof and make one start over from scratch. It feels like writing a few hundred lines of code only to have the final program run “rm -rf *” on the directory containing it. It can be really. really. depressing.

It is an interesting pedagogical question in my mind whether there is a way to introduce proofs and the language of mature mathematics in a way that stays within a stone’s throw of computer programs. It seems like a worthwhile effort, but I can’t think of anyone who has sought to replace a classical mathematics education entirely with one based on computation.

## Mathematical syntax

Another major reason programmers are unwilling to give mathematics an honest effort is the culture of mathematical syntax: it’s ambiguous, and there’s usually nobody around to explain it to you. Let me start with an example of why this is not a problem in programming. Let’s say we’re reading a Python program and we see an expression like this:

foo[2]

The nature of (most) programming languages dictates that there are a small number of ways to interpret what’s going on in here:

1. foo could be a list/tuple, and we’re accessing the third element in it.
2. foo could be a dictionary, and we’re looking up value associated to the key 2.
3. foo could be a string, and we’re extracting the third character.
4. foo could be a custom-defined object, whose __getitem__ method is defined somewhere else and we can look there to see exactly what it does.

There are probably other times this notation can occur (although I’d be surprised if number 4 didn’t by default capture all possible uses), but the point is that any programmer reading this program knows enough to intuit that square brackets mean “accessing an item inside foo with identifier 2.” Part of the reasons that programs can be very easy to read is precisely because someone had to write a parser for a programming language, and so they had to literally enumerate all possible uses of any expression form.

The other extreme is the syntax of mathematics. The daunting fact is that there is no bound to what mathematical notation can represent, and much of mathematical notation is inherently ad hoc. For instance, if you’re reading a math paper and you come across an expression that looks like this

$\delta_i^j$

The possibilities of what this could represent are literally endless. Just to give the unmathematical reader a taste: $\delta_i$ could be an entry of a sequence of numbers of which we’re taking arithmetic $j^\textup{th}$ powers. The use of the letter delta could signify a slightly nonstandard way to write the Kronecker delta function, for which $\delta_i^j$ is one precisely when $i=j$ and zero otherwise. The superscript $j$ could represent dimension. Indeed, I’m currently writing an article in which I use $\delta^k_n$ to represent $k$-dimensional simplex numbers, specifically because I’m relating the numbers to geometric objects called simplices, and the letter for those is  a capital $\Delta$. The fact is that using notation in a slightly non-standard way does not invalidate a proof in the way that it can easily invalidate a program’s correctness.

What’s worse is that once mathematicians get comfortable with a particular notation, they will often “naturally extend” or even silently drop things like subscripts and assume their reader understands and agrees with the convenience! For example, here is a common difficulty that beginners face in reading math that involves use of the summation operator. Say that I have a finite set of numbers whose sum I’m interested in. The most rigorous way to express this is not far off from programming:

Let $S = \left \{ x_1, \dots, x_n \right \}$ be a finite set of things. Then their sum is finite:

$\displaystyle \sum_{i=1}^n x_i$

The programmer would say “great!” Assuming I know what “+” means for these things, I can start by adding $x_1 + x_2$, add the result to $x_3$, and keep going until I have the whole sum. This is really just a left fold of the plus operator over the list $S$.

But for mathematicians, the notation is far more flexible. For instance, I could say

Let $S$ be finite. Then $\sum_{x \in S} x$ is finite.

Things are now more vague. We need to remember that the $\in$ symbol means “in.” We have to realize that the strict syntax of having an iteration variable $i$ is no longer in effect. Moreover, the order in which the things are summed (which for a left fold is strictly prescribed) is arbitrary. If you asked any mathematician, they’d say “well of course it’s arbitrary, in an abelian group addition is commutative so the order doesn’t matter.” But realize, this is yet another fact that the reader must be aware of to be comfortable with the expression.

But it still gets worse.

In the case of the capital Sigma, there is nothing syntactically stopping a mathematician from writing

$\displaystyle \sum_{\sigma \in \Sigma} f_{\Sigma}(\sigma)$

Though experienced readers may chuckle, they will have no trouble understanding what is meant here. That is, syntactically this expression is unambiguous enough to avoid an outcry: $\Sigma$ just happens to also be a set, and saying $f_{\Sigma}$ means that the function $f$ is constructed in a way that depends on the choice of the set $\Sigma$. This often shows up in computer science literature, as $\Sigma$ is a standard letter to denote an alphabet (such as the binary alphabet $\left \{ 0,1 \right \}$).

One can even take it a step further and leave out the set we’re iterating over, as in

$\displaystyle \sum_{\sigma} f_{\Sigma}(\sigma)$

since it’s understood that the lowercase letter ($\sigma$) is usually an element of the set denoted by the corresponding uppercase letter ($\Sigma$). If you don’t know greek and haven’t seen that coincidence enough times to recognize it, you would quickly get lost. But programmers must realize: this is just the mathematician’s secret handshake. A mathematician would be just as bewildered and confused upon seeing some of the pointer arithmetic hacks C programmers invent, or the always awkward infinite for loop.

for (;;) {
;
}

And once the paper you’re reading is over, and you start reading a new paper, chances are their conventions and notation will be ever-so-slightly different, and you have to keep straight what means what. It’s as if the syntax of a programming language changed depending on who was writing the program!

Perhaps understandably, the frustration that most mathematicians feel when dealing with varying syntax across different papers and books is collectively called “technicalities.” And the more advanced the mathematics becomes, the ability to fluidly transition between high-level intuition and technical details is all but assumed.

The upshot of this whole conversation is that the reader of a mathematical proof must hold in mind a vastly larger body of absorbed (and often frivolous) knowledge than the reader of a computer program.

At this point you might see all of this as my complaining, but in truth I’m saying this notational flexibility and ambiguity is a benefit. Once you get used to doing mathematics, you realize that technical syntax can make something which is essentially simple seem much more difficult than it is. In other words, we absolutely must have a way to make things completely rigorous, but in developing and presenting proofs the most important part is to make the audience understand the big picture, see intuition behind the symbols, and believe the proofs. For better or worse, mathematical syntax is just a means to that end, and the more abstract the mathematics becomes, the more flexiblility mathematicians need to keep themselves afloat in a tumultuous sea of notation.

## You’re on your own, unless you’re around mathematicians

That brings me to my last point: reading mathematics is much more difficult than conversing about mathematics in person. The reason for this is once again cultural.

Imagine you’re reading someone else’s program, and they’ve defined a number of functions like this (pardon the single-letter variable names, I’m just don’t like “foo” and “bar”).

def splice(L):
...

def join(*args):
...

def flip(x, y):
...

There are two parts to understanding how these functions work. The first part is that someone (or a code comment) explains to you in a high level what they do to an input. The second part is to weed out the finer details. These “finer details” are usually completely spelled out by the documentation, but it’s still a good practice to experiment with it yourself (there is always the possibility for bugs, of course).

In mathematics there is no unified documentation, just a collective understanding, scattered references, and spoken folk lore. You’re lucky if a textbook has a table of notation in the appendix. You are expected to derive the finer details and catch the errors yourself. Even if you are told the end result of a proposition, it is often followed by, “The proof is trivial.” This is the mathematician’s version of piping output to /dev/null, and literally translates to, “You’re expected to be able to write the proof yourself, and if you can’t then you’re not ready to continue.”

Indeed, the opposite problems are familiar to a beginning programmer when they aren’t in a group of active programmers. Why is it that people give up or don’t enjoy programming? Is it because they have a hard time getting honest help from rudely abrupt moderators on help websites like stackoverflow? Is it because often when one wants to learn the basics, they are overloaded with the entirety of the documentation and the overwhelming resources of the internet and all its inhabitants? Is it because compiler errors are nonsensically exact, but very rarely helpful? Is it because when you learn it alone, you are bombarded with contradicting messages about what you should be doing and why (and often for the wrong reasons)?

All of these issues definitely occur, and I see them contribute to my students’ confusion in my introductory Python class all the time. They try to look on the web for information about how to solve a very basic problem, and they come back to me saying they were told it’s more secure to do it this way, or more efficient to do it this way, or that they need to import something called the “heapq module.” When really the goal is not to solve the problem in the best way possible or in the shortest amount of code, but to show them how to use the tools they already know about to construct a program that works. Without a guiding mentor it’s extremely easy to get lost in the jungle of people who think they know what’s best.

As far as I know there is no solution to this problem faced by the solo programming student (or the solo anything student). And so it stands for mathematics: without others doing mathematics with you, its very hard to identify your issues and see how to fix them.

## Proofs, Syntax, and Community

For the programmer who is truly interested in improving their mathematical skills, the first line of attack should now be obvious. Become an expert at applying the basic methods of proof. Second, spend as much time as it takes to clear up what mathematical syntax means before you attempt to interpret the semantics. And finally, find others who are interested in seriously learning some mathematics, and work on exercises (perhaps a weekly set) with them. Start with something basic like set theory, and write your own proofs and discuss each others’ proofs. Treat the sessions like code review sessions, and be the compiler to your partner’s program. Test their arguments to the extreme, and question anything that isn’t obvious or trivial. It’s not uncommon for easy questions with simple answers and trivial proofs to create long and drawn out discussions before everyone agrees it’s obvious. Embrace this and use it to improve.

Short of returning to your childhood and spending more time doing recreational mathematics, that is the best advice I can give.

Until next time!

# P vs. NP, A Primer (And a Proof Written in Racket)

## Decidability Versus Efficiency

In the early days of computing theory, the important questions were primarily about decidability. What sorts of problems are beyond the power of a Turing machine to solve? As we saw in our last primer on Turing machines, the halting problem is such an example: it can never be solved a finite amount of time by a Turing machine. However, more recently (in the past half-century) the focus of computing theory has shifted away from possibility in favor of determining feasibility. In particular, we want to know how fast we can solve a particular problem on a Turing machine. The goal is to design efficient algorithms for important real-world problems, and know when it is impossible to design a more efficient algorithm than what we already have. From a mathematical perspective, that often means we would like to find the boundary problems: those which are “just barely” too hard to admit an efficient algorithm. Of course, this requires us to first define what it means for an algorithm to be “efficient.”

These questions are intimately intertwined with big-O analysis, which we presented in our very first primer on this blog, and the definitions we investigate in the sequel require fluency in such notation. That being said, we try to give colloquial analogues to the most important ideas.

## The Class P

The general goal here is to describe classes of problems that roughly correspond to how efficiently one can solve them. The first such class (and by all rights, the smallest class we will consider) is called P, which stands for “polynomial.”

Definition: P is the set of languages which can be decided by a Turing machine in $O(n^k)$ for some $k$, where $n$ is the input size of the problem. In the common notation:

$\displaystyle \textup{P} = \bigcup_{k \in \mathbb{N}} \textup{TIME}(n^k)$

Where $\textup{TIME}(n^k)$ is the set of problems decidable in $O(n^k)$ time. Recall that a language is any set of strings (any subset of $\Sigma^*$, as we have been investigating in our past computing theory primers and our post about metrics on words).

We don’t have too much to say about P itself, except that many common problems are in P. The real difficulty is in proving that some problem does not lie in P.

Here is an example this author has used to explain to elementary school children what’s it’s like to not be in P. Consider a deck of 52 playing cards, and your goal is to sort the cards by suit and then by rank (as it is when you open a fresh pack of cards). There is one obvious way to do this that children teach themselves quite readily: just look for the Ace of spades, put it on the bottom of a new stack. Then find the two of spades, and continue in this manner until the deck is sorted. This algorithm takes $O(n^2)$, since in the very worst case the deck is sorted in reverse order, and if you’re too dim to realize it, you look through the entire deck at every step.

Of course, there are much faster ways to do it than the approach we detailed above, and it is a well-known theorem that sorting is $\Theta(n \log(n))$. That is, there is both a lower bound and an upper bound of $n \log(n)$ for the general sorting problem.

Now let’s investigate a ridiculous algorithm for sorting cards. What if at each step, we shuffled the deck, and then checked to see if it magically became sorted due to our shuffling. Forgetting for a moment that randomness comes into play, we would find the correct sorting once in $52!$ trials, and it would take 52 steps to check if it was sorted. For a deck of $n$ cards, this would hence take time $O(n! \cdot n)$, which would take so long even for 52 cards, that the sun would go out before you could finish. But what if we didn’t know a better way?

For an example of a problem where we don’t know a better way, consider the problem of graph coloring. Is there a Turing machine which, when given as input a graph $G$ and a number $k$, can determine in polynomial time whether $G$ admits a $k$-coloring? There is an obvious algorithm to decide the problem: there are only finitely many choices of assignments of vertices to colors, and we can simply check them all. In fact, there are $k^n$ of them, where $n$ is the number of vertices of $G$.

Unfortunately, this algorithm is not polynomial in runtime: we would have to enumerate all of the different colorings, and check whether each is a valid coloring; this process is easily seen to be $o(nk^n)$, which is far from polynomial for arbitrary $k > 2$.

But the true challenge is this: how do we know there is no “faster” algorithm? Of all the crazy wild ideas one could have to solve a problem, how do we know none can be so clever that they reduce the running time to be a polynomial in $n$?

In fact, we don’t know for sure that there isn’t! This is the heart of the open problem which is succinctly called “P vs. NP”.

## The Class NP

While P is a class of “easy” problems from one perspective (problems that can be solved quickly, even in the worst case), being a member of NP is another measure of “easiness,” but from a different perspective.

Definition: NP is the class of problems which have a polynomial-time verifier. That is, given an input $w$ to a problem $A \in \textup{NP}$, there is a string $c$ called a certificate and a Turing machine $M$ for which $M$ verifies that $c$ proves $w \in A$ and runs in polynomial time.

This definition is a bit hard to swallow, but examples clarify the situation greatly. For the problem of graph coloring, we note that a certificate would simply be a list of pairs $(v_i, n_i)$ which give a coloring of the graph $G$. It is quite trivial to define a polynomial-time Turing machine that ensures the coloring of $G$ is valid. Hence, graph coloring is in NP. This is the case with most problems in NP: a proof that $w \in A$ is hard to find, but easy to verify once you have it.

There is another definition of NP which is often useful, and it gives a reason for prefixing the “polynomial” part of P with an N.

Definition: NP is the set of problems which are solvable by a nondeterministic Turing machine in polynomial time.

For the motivating picture behind “nondeterministic Turing machines,” we turn to an analogy. Imagine you have an infinite number of computers running in parallel, and they can communicate instantaneously. What sorts of problems could we solve in polynomial time with such a machine? We could certainly solve graph coloring: simply have each machine try one of the $k^n$ different colorings, and have the entire machine halt when one coloring is found (or when they all finish, we can safely reject that the graph is k-colorable).

So we can reformulate the definition in set notation as:

$\displaystyle \textup{NP} = \bigcup_{k \in \mathbb{N}} \textup{NTIME}(n^k)$

Here $\textup{NTIME}(f(n))$ is the set of all languages which can be solved in time $f(n)$ on a nondeterministic Turing machine.

In other words, “NP” stands for “nondeterministic polynomial-time” problems. And in fact this definition is equivalent to existence of a polynomial-time verifier, as in our first definition. To see this, note that we can construct a nondeterministic machine that enumerates all possible certificates (lending to our analogy, one on each of the infinitely numerous parallel computers), and then tests them using the polynomial-time verifier. Since each branch uses a polynomial-time algorithm, the whole Turing machine runs in (nondeterministic) polynomial time. On the other hand, if some branch of computation halts in deterministic time, then the sequence of configurations of the tape for that branch has polynomial length, and so a Turing machine can simply run that computation to ensure it follows the rules of Turing machines and ends in an accepting state. This clearly represents a certificate.

One might ask why we care about infinitely parallel computers. In reality, we can only have finitely many computations going at once, so why bother with this silly mathematical uselessness? As it often turns out in mathematics, it is useful to think about such structures simply because they capture the essence of what we wish to study in a concise and formal manner. For complexity theory, nondeterministic Turing machines capture the level of complexity we wish to understand: in a sense, it’s the ”next level up” from polynomial-time decidable problems.

## K-Clique and 3-Sat

We have two more examples of problems in NP which will be important later in this post: the problems of 3-Satisfiability and finding a k-Clique.

Definition: Let $\varphi(x_1, \dots, x_n)$ be a propositional formula in $n$ boolean variables. $\varphi$ is satisfiable if there exists an assignment of the variables $x_1, \dots, x_n \to \left \{ \textup{T}, \textup{F} \right \}$ that makes $\varphi$ true.

For example, we could have the formula

$\displaystyle (x \vee y \vee \overline{z}) \wedge (\overline{x} \vee \overline{y} \vee z)$

And this is satisfiable by the assignment of $x = \textup{T}, y = \textup{F}, z = \textup{T}$.

It should be obvious to the reader now that determining whether a formula is satisfiable is in NP: a certificate is just a list of variables mapping to true or false, and checking that the formula is satisfied by a given assignment can be done in polynomial time.

In 3-Satisfiability (usually shortened to 3-Sat), we simply restrict the form of $\varphi$ and ask the same question. The form is called conjunctive normal form, and colloquially it is a bunch of clauses joined with “and,” where each clause is a bunch of variables (and their negations) connected by “or”. Moreover, the “3″ in 3-Sat requires that each clause contain exactly three literals. For example, the equation given above would be a valid input to 3-Sat.

k-Clique, on the other hand, is a question about graphs. Given a graph $G$ and a positive integer $k$, determine whether $G$ contains a complete subgraph of $k$ vertices. (In a complete graph, there is an edge connecting each pair of vertices; the name “clique” is motivated by the party problem, in the sense that there is a “clique” of friends at the party who are all mutually friends with each other.)

As expected, a certificate that a graph has a $k$-Clique is just a list of the vertices in the clique, and checking that all pairs of vertices listed have a connecting edge is easily seen to take $O(|G|k^2) = O(n^3)$, which is polynomial in the size of the input.

## NP-Completeness

As it turns out, the problems of 3-Satisfiability and k-Clique are quite special (as is graph coloring). They belong to a special sub-class of NP called NP-complete. Before we can define what NP-complete means, we have to be able to compare problems in NP.

Definition: Given two languages $A, B$, we say $A \leq_p B$, or $A$ is polynomial-time reducible to $B$ if there exists a computable function $f: \Sigma^* \to \Sigma^*$ such that $w \in A$ if and only if $f(w) \in B$, and $f$ can be computed in polynomial time.

We have seen this same sort of idea with mapping reducibility in our last primer on Turing machines. Given a language $B$ that we wanted to show as undecidable, we could show that if we had a Turing machine which decided $B$, we could solve the halting problem. This is precisely the same idea: given a solution for $B$ and an input for $A$, we can construct in polynomial time an input for $B$, use a decider for $B$ to solve it, and then output accordingly. The only new thing is that the conversion of the input must happen in polynomial time.

Of course, this discussion was the proof of a clarifying lemma:

Lemma: If $B \in \textup{P}$ and $A \leq_p B$, then $A \in \textup{P}$.

The proof is immediate, as $B$ can be solved in polynomial time, and the conversion function runs in polynomial time. We leave the construction of an explicit Turing machine to decide $A$ as an exercise to the reader.

To phrase things more colloquially, $A \leq_p B$ is true if $A$ is an easier” problem than $B$, hence justifying the less-than notation.

And now for the amazing part: there are problems in NP which are harder than all other problems in NP.

Definition: A language $A \in \textup{NP}$ is called NP-complete if for all problems $B \in \textup{NP}$, $B \leq_p A$.

In other words, all problems in NP reduce to an NP-complete problem in polynomial time. In fact, we get another nice fact about NP-completeness that mirrors our observation about P above:

Lemma: If $A$ is NP-complete and $B \in \textup{NP}$ with $A \leq_p B$, then $B$ is NP-complete.

Obviously the composition of two polynomial-time reductions is a polynomial-time reduction, so we can conclude that all problems in NP which reduce to $A$ also reduce to $B$.

The cautious reader should be rather hesitant to believe that NP-complete problems should even exist. There is no reason we can’t come up with harder and harder problems, so why should there be a point after which we can’t quickly verify a solution?

Well, Stephen Cook proved in 1971 that there is an NP-complete problem, and shortly thereafter many more were found. Today, there are thousands of known NP-complete problems.

Perhaps unsurprisingly, Cook’s original NP-complete problem was N-Satisfiability (i.e., 3-Satisfiability without a constraint on the number of clauses or the form). Unfortunately the proof is quite difficult. We point the reader to the relevant Wikipedia page, and briefly mention the outline of a proof.

Given a nondeterministic polynomial-time Turing machine, we can bound the number of parallel computations and the length of each computation by $n^k$ for some fixed $k$. Then we create a $n^k$ by $n^k$ table of the configurations of the Turing machine (the i,j cell for the i-th branch of computation and the j-th step). From this, we can construct a monstrously enormous (yet polynomial in size) formula which has a satisfying assignment if and only if the Turing machine halts on some branch of computation in an accept state. Here is a table of the formulas needed to do this. In short, the formula traces the computation of the machine at each step, and ensures the transition function is honestly followed, the tape is reliably updated, and the head of each tape moves in the correct direction.

The reason we say it’s unsurprising that Satisfiability is NP-complete is because it’s commonly believed that every aspect of mathematics boils down to pure logic, although the process of doing so is entrenched in gory detail every step of the way. So it’s understandable that all problems in NP reduce to a problem about logic which is also in NP. We stipulate that other complexity classes likely have “complete” problems that are essentially questions about logical formulas.

## A New Way to Find NP-Complete Problems

Now that we have established the NP-completeness of Satisfiability, we can do the same for other problems by reduction from a known NP-complete problem. First, we claim that 3-Satisfiability is NP-complete, and we leave the proof as an exercise to the reader (hint: reduce from regular Satisfiability by putting the formula into the right form).

Now given that 3-Sat is NP-complete, we will prove that k-Clique is NP-complete, by reduction from 3-Sat (in fact our conversion function will work for any formulas in conjunctive normal form, but 3 is enough).

Theorem: k-Clique is NP-complete.

Proof. Given a formula $\varphi$ in conjunctive normal form, we construct an instance of k-Clique as follows. First, let $k$ be the number of clauses in $\varphi$. Construct a graph $G_{\varphi}$ by creating a vertex for each literal term in $\varphi$, and (to help visualization) organize them into columns by their originating clause, and label the vertex with its corresponding literal. Introduce an edge connecting two terms $a, b$ in different columns when the formula $a \wedge b$ is not a contradiction. In other words, it cannot be of the form $x \wedge \overline{x}$ for some variable $x$.

As a concrete example, the formula

$\displaystyle (x \vee \overline{y} \vee z) \wedge (\overline{x} \vee \overline{z} \vee w)$

converts to the graph

We claim that $\varphi$ has a satisfying assignment of variables if and only if $G_{\varphi}$ has a k-clique. Supposing there is a valid assignment of variables in $\varphi$, then there must be one variable in each clause which is true (and hence $k$ variables). This translates to $k$ vertices in $G_{\varphi}$, one vertex in each column which is true, and none of these vertices conflict with each other, so $G_{\varphi}$ has an edge connecting each pair of the $k$ vertices. Conversely, suppose $G_{\varphi}$ has a k-clique. By our construction, two edges in the same column cannot be connected by and edge, and hence this k-clique must have one vertex in every column. If the vertex is labeled with a negation, assign it to the value $F$ (so that the literal evaluates to true), and otherwise assign it the value $T$. This gives a satisfying assignment of the variables of $\varphi$, since each clause will evaluate to true under this assignment.

The final part of the proof is that the conversion function runs in polynomial time, and we claim this is obvious from the construction: if $\varphi$ has $n$ literals, then we create $O(n)$ vertices and $O(n^2)$ edges. The creation of each vertex and edge can be done in constant time, as can the verification that two literals do not conflict. $\square$

Of course, this is a question about the possibilities of computers. Instead of giving a theoretical proof, why not just write a program to compute the conversion? Well we did just that, and the main body of the code turned out to be quite tidy:

;; n-sat-to-clique: formula -> (listof edge)
;; transform an input to n-sat to an input for clique
;; assume the input expression is in CNF, and that
(define (n-sat-to-clique expr)
(let* ([conjuncts (∧-conjuncts expr)]
[columns (map (λ (x) (∨-disjuncts x)) conjuncts)]
[labeled-columns (label-graph columns 1)]
[possible-edges (all-pairs-in-distinct-lists labeled-columns)])
(list->set (filter no-conflict? possible-edges))))

We have a data structure for a general formula (and provide a function to compute the conjunctive normal form of any expression), and a data structure for a graph (essentially, a list of pairs of labelled vertices), and so the process of checking all possible edges, and filtering out those which have no conflict, clearly takes $O(n^2)$ time.

The rest of the code required to run the function above is available on this blog’s Google code page.

## Other NP-complete Problems, and P =? NP

In the real world NP-complete problems show up everywhere. Application domains include cryptography, financial portfolio and investment management, scheduling and operation dynamics, packing problems, chem- and bioinformatics, guarding art galleries, circuit design, compiler design, and even modelling social networks.

There are even many questions one can ask about games that turn out to be NP-complete in complexity. For instance, many questions about the classic game of Tetris are NP-complete, along with Minesweeper, FreeCell, Lemmings, Battleship, and Mastermind.

Now the big unsolved question is does P = NP? In other words, can any of these seemingly hard problems be solved quickly? The simple fact is, if the answer is yes to one such problem, then the answer is yes not only to all NP-complete problems, but to all problems in NP (as we saw in our lemma earlier). This is the heart of the million-dollar question that has been crowned the most difficult open problem in computer science to date. Almost everyone agrees that P and NP should not be equal, but nobody can prove it.

Of course, common people love to talk about P and NP because of all of the wild things that would happen if we suddenly discovered that P = NP. All widely used security systems would fail, internet transactions would no longer be safe, efficiency in industry would increase by orders of magnitude, we’d unlock the secrets of the human genome, we’d quickly solve open mathematical problems and find staggeringly ginormicon primes (ginormicon = gigantic + enormous + decepticon), governments will topple, rioting in the streets, the artificial intelligence singularity will occur, etc., etc.

But all of this is just media hype. The likely reality is that some problems are simply too hard to be solved in polynomial time in general, just as there are probably problems which have no polynomial-time verifiers (i.e., problems outside of NP), excluding the trivial problems which are undecidable. In the end, it’s just a matter of time until mathematics sorts everything out, or proves that it’s impossible to do so. Indeed, just two years ago this author remembers waking up to the news that there was a 100 page proof that P is not equal to NP, and moreover that Stephen Cook himself considered it a serious attempt. Unfortunately it turned out to have irreparable flaws, but failure made it no less exciting: this is how great mathematics is made.

On the other hand, there are cranks out there who have, for the last few years, been convinced that they have proved P = NP, but are ignorant of their own flaws and the rest of the world’s criticism. May Gauss have mercy on their mathematical souls.

Until next time!

# Busy Beavers, and the Quest for Big Numbers

## Finding Bigger Numbers, a Measure of Human Intellectual Progress

Before we get into the nitty gritty mathematics, I’d like to mirror the philosophical and historical insights that one can draw from the study of large numbers.

That may seem odd at first. What does one even mean by “studying” a large number? Of course, I don’t mean we stare at the number 1,000,000,000,000, which is quite large, and wonder how mankind can benefit from its elusive properties. What I really mean is that scientific and mathematical discoveries are very closely tied in our collective ability to describe large numbers.

That may seem even odder, but let’s enjoy a short historical digression to illustrate the point. Note that much of this historical information is borrowed from an essay by Scott Aaronson. He gives a wonderfully entertaining essay on the topic of Busy Beavers, but where he assumes the reader has no background knowledge of computing theory, we will assume the reader is familiar with the relevant computational topics in our primers.

In the age of the Greeks, it was generally believed that some quantities were beyond counting. Things like the number of grains of sand in the desert were forfeit to the realm of “infinity.” But in the third century B.C.E., Archimedes recognized that they weren’t beyond counting. And in his treatise The Sand Reckoner, he developed a rudimentary notion of exponentials, and was able to provide an upper bound on such mysterious numbers:

There are some [...] who think that the number of the sand is infinite in multitude [...] again there are some who, without regarding it as infinite, yet think that no number has been named which is great enough to exceed its multitude [...] But I will try to show you [numbers that] exceed not only the number of the mass of sand equal in magnitude to the earth [...] but also that of a mass equal in magnitude to the universe.

He proceeded to give an upper bound on the number of grains of sand needed to fill the universe: $10^{63}$. Now this was a quite large number, and certainly beyond most people’s ability to visualize in quantity. But by the time Arabic numerals and algebra became common worldly knowledge in the Middle Ages, exponentiation became a paradigm trivially expressed, allowing people to write such large numbers as $10^{10^{10}}$ with the same ease we do today. These sorts of counting exercises became the topic of mathematical folklore (think of the tale of rice filling a chessboard), and they run amok in contemporary discussions of finance, statistics, physics, economics, and computer science.

Let’s take a moment to investigate exactly how ridiculously huge exponentials are. And in spite of the awe we show for such gargantuan quantities, we foreshadow that exponents are far from the biggest numbers we can grapple.

Let’s take a large digit, say, 9, and repeat it a thousand times to make a number. By all rights, this is a big number! It has a whopping one thousand digits, and each decimal place is as large as it can be. Now, consider the meager $9^9$. Given a few annoying minutes, we could actually perform nine multiplications and get this number exactly; we’ll save you the trouble, it’s 387,420,489, a number with nine digits. Now, let’s inspect $9^{9^9}$. This number, requiring us to write only one more digit on the page, is $9^{387,420,489}$, a number with a staggering 369,693,100 decimal digits! Raising 9 to this power once more, to get $9^{9^{9^9}}$ is already too large for Wolfram Alpha to give much information about.

Now, come the 20th century we see two big jumps in the expression of large numbers. The first is the Ackermann Function. For the purposes of this post, we can think of it as simply a sequence of numbers, which is constructed as follows. The first number in the sequence, $A(1)$, is simply $1+1$. The next is $A(2) = 2*2$, then $A(3) = 3^3$, and so on. Of course, defining “and so on” is tough, because we don’t have commonly known names for such operations. The fourth number in this sequence $A(4)$, is 4 raised to the power of itself four times. I.e., $A(4) = 4^{4^{4^4}}$. We call this operation a tetration of the number 4 to the height of 4, and don it with the notation $4 \uparrow \uparrow 4$. This isn’t so hard to visualize, but while it’s a huge number in itself (it has $10^{154}$ decimal digits), the best is yet to come. The fifth element of this sequence, $A(5)$, is the nested tetration of 5 with itself 5 times. In other words, it is

$A(5) = 5 \uparrow \uparrow (5 \uparrow \uparrow (5 \uparrow \uparrow (5 \uparrow \uparrow 5)))$

For convenience, we give this the notation $5 \uparrow \uparrow \uparrow 5$, and call it pentation. Of course, $A(6)$ would then be 6 hexated to the height of 6, and so on forever. The brave reader will take this to its extreme, and imagine how huge $A(A(5))$ is… Mind blowing.

Of course, one might ask: where the hell did this beast come from? What are Ackermann numbers good for anyway? The Ackermann Function and it’s associated sequence, it turns out, serve as important counterexamples in the theory of computable functions. The familiar reader will recognize the claim that the Ackermann Function is computable, but not primitive recursive. For the rest of the world, here’s a historical explanation of its origin.

Alan Turing, the inventor of the Turing machine.

Back in the days before computers (the early 20th century), mathematicians were fiercely interested in the theory of computing things. The common visitor to this blog will recall our primers on Finite Automata, Turing Machines, and such things as the Game of Life. It might not be a surprise to hear that there are many many different models of computation, and before real computers actually came along it was the job of mathematicians to compare them, sorting out which were “more useful” or “more expressive” than others. In the end, the Turing machine model was actually used to create rudimentary computers, which have since evolved into the supercomputers and iPads we have today. The mathematicians exploring computation during this era were at the forefront of a technological and intellectual revolution. On their shoulders, humanity entered the so-called Age of Information.

Perhaps the original such study of computation was the idea of a computable function of the natural numbers. Given a set of elementary functions which are axiomatically defined to be computable, and a way of combining these elementary functions to make more complicated functions, mathematicians constructed a huge class of functions that were computable. For instance, the function which adds two numbers together is computable; we all learned a nice algorithm to perform addition in grade school. The definitions and axioms were chosen in order to affirm such easy tasks, and then mathematicians explored the axioms to see how far they could push them. This line of reasoning would bring one to the ultimate quest: for a given model of computation, figure out what sorts of things are not computable.

Of course, even without a firm foundation in the theory of computable functions, we would guess that the sequence of Ackermann numbers is computable. In fact, using the model of Turing machines (which happens to be equivalent to the theory of computable functions, insofar as the tasks it deems computable) we can prove it so. We have unbounded space and unbounded time, and we know for a fact that all of these mathematical operations boil down to repeated multiplication. We can even present pseudocode, and such sites as RosettaCode have hundreds of implementations, ranging from a single line of K to a tail-recursive implementation in OCaml. There is no doubt about it, the Ackermann function is computable. In a somewhat naive sense, the Ackermann sequence is no deeper or more insightful than Archimedes’s first attempt at exponentiation. It just carries the idea of exponentiation as far as it can go.

The next natural question is: are there any sequences of numbers which grow so ridiculously fast that they simply cannot be computed? Of course, we know already of one undecidable (non-computable) task: the halting problem. Recall that the halting problem asks if it is possible to design a computer program $T$ which, when given the source code of any other computer program $S$ and an input to that program $w$, can determine in a finite amount of time whether $S$ will loop infinitely on the input $w$, or whether it will halt. The blatant fact is that this is impossible. A hint at the proof is the question: how would this program analyze itself? What about it’s evil twin that does exactly the opposite? Here’s a comical version of the proof in Dr. Seuss form, and a more formal proof can be found in our primers on the subject.

As we have seen before, one way to prove that something is not computable is to first assume that it is computable, and then use that to construct a program that solves the halting problem. This results in a contradiction, and we conclude that the original problem could not be computable. In other words, we are looking for a sequence of numbers which has the following property: if we were able to compute an arbitrarily large entry of the sequence in a finite amount of time, we could use that ability to determine whether any program will halt or loop infinitely in a finite amount of time.

Of course, the idea that such a sequence should even exist is dubious. But amazingly enough, we will construct one, and find the next level of big numbers.

## Busy Beaver Numbers

The idea behind Busy Beavers is quite simple. It was discovered by Tibor Radó, a Hungarian mathematician, in May of 1962, and does not rely at all on the arithmetic expressions of the past. The idea requires a bit of knowledge about Turing machines, so after promoting our primer once more, we will reiterate the ideas.

A Turing machine is essentially an infinite strip of paper called the tape on which one scribbles symbols in a deterministic way. In order to know what to do at each step, a Turing machine has an internal state record, with a finite set of rules for how to transition from state to state and manipulate the tape (write new symbols, erase old symbols, and move around to different places on the tape). In full rigor, the Turing machine is only allowed to look at one of the symbols on the tape at a time. So one should truly imagine a little beaver running back and forth across his dam, making tick marks in the wood and never knowing what he’s going to do next until he sees the tick mark next to the one he just made. The Turing machine has no limit on the amount of time it can run, and the tape is infinite, so it can write as many symbols as it needs to do its work. In the end, it will either stop, report a  ”yes” or a “no,” or it will keep on going until the end of time, never to halt.

Perhaps understandably, the Busy Beaver numbers have to do with counting steps of computation in Turing machines. In particular, for a fixed number $n$, it is possible to look at all Turing machines which have exactly $n$ states in their internal record (indeed, there can only be finitely many, and it is not too difficult to come up with an exact number). Among these Turing machines, some will run forever and some will eventually stop. We only care about the ones that stop, and the largest possible number of steps that those machines take before stopping.

Definition: The $n$th Busy Beaver number, denoted $BB(n)$, is the largest number of steps taken by a Turing machine with exactly $n$ states, which uses only two symbols on the tape (say 0 and 1), and which begins with an initially blank tape (all of the entries start as 0).

In other words, $BB(n)$ is the number of steps taken by the buiest beaver of all. We call a Turing Machine which achieves the required number of steps a busy beaver.

Look at that cute little beaver run along.

Unsurprisingly, there are very few one-state busy beavers. Recalling the formal definition of a Turing machine, it must have either an accept state or a reject state, and if there is only one state and the machine halts, then it takes only one step before halting. Hence, $BB(1) = 1$. A bit of number crunching will convince the reader that the second Busy Beaver number is in fact $BB(2) = 6$. Not impressed? Well it took a bit of research to get there, but Radó (the original discoverer of the Busy Beaver numbers) proved $BB(3) = 21$. This required a mountain of work and many different partial results to acquire. Why is it so hard, you ask? The fact is, there is no algorithm to list the Busy Beaver numbers. It is an uncomputable function. Let’s give a proof of this before we get to the other known values of $BB(n)$.

Theorem: $BB(n)$ is uncomputable. That is, there is no Turing machine that can takes as input $n$ and computes $BB(n)$.

Proof. Suppose to the contrary that such a machine existed, and call it $T$. We will use $T$ to construct a machine which solves the halting problem as follows:

On input $\left \langle T, w \right \rangle$, a description of a Turing machine and its associated input, we can determine in finite time the number of states that $T$ has (it is encoded in the description of $T$). Now simply compute $BB(n)$, where $n$ is the number of states in $T$, and simulate $T$ on input $w$, counting its steps as it proceeds. Eventually the simulation of $T$ either halts, or makes more steps than $BB(n)$. In the former case, we may stop the computation and declare that $T$ halts. In the latter, we know that $BB(n)$ is the maximal number of steps that can occur for any Turing machine with $n$ states that eventually halts. Therefore, if $T$ had not halted on the input $w$, it must never halt. We may stop our computation there, proclaim an infinite loop, and thereby solve the halting problem.

This is a contradiction, so the sequence $BB(n)$ cannot be computable. $\square$

This gives some insight into how ridiculously fast $BB(n)$ grows. In fact, it not only grows faster than the Ackermann function, but it grows faster than any sequence of numbers that could ever be computed by a machine! Even with infinite time to do so! Indeed, if $BB(n)$ were bounded by some computable function, then we could compute the upper bounds for $BB(n)$, use that to eliminate Turing machines that run too long until we narrow down the exact answer, and this would all happen in a finite amount of time.

In a sense, this is completely understandable. Computers are extremely complex! Applications like web browsers only need millions or billions of states, and in the proof above we’re pretending we could determine what all programs would do, even if they have trillions or quadrillions or $A(1000)$ states! Imagine all of the crazy things that could happen with that amount of computing power. The colloquial thesis is that a single program is simply not powerful enough to know all of that. It’s like asking a human to read the minds of every human on the planet, all of the conscious thoughts of simpler animals, and all of the minds of aliens that planet Earth has never dreamed of, but which perhaps reside in faraway galaxies. For indeed, a Turing machine that could count Busy Beaver numbers must itself have finitely many states, and halt on any well-formed input, but be able to reason about such mysterious behaviors of Turing machines that have never been created.

The first three values of $BB(n)$ were indeed wimpy, but now that we know it grows unfathomably quickly, we can mention the remaining known results. First, $BB(4) = 107$ was proved in the 1980′s. But for $BB(5)$, the best known lower bound is 47,176,870, and this was found as recently as 1990. On the other hand, in July of 2010 $BB(6)$ was discovered to be no smaller than  $7.412 \cdot 10^{36,534}$, and it is generally believed that this value will never be known. Nobody is brave enough to give a stab at such higher numbers in the sequence as $BB(10)$, and understandably so.

Plato: “One day, mankind will bask in enlightenment, having computed the Busy Beaver function!” Aristotle: “Word.”

In this light, the Ackermann function is a wimp. But there are some important philosophical issues to consider once more. In Archimedes’s day, people generally believed the sands were beyond counting, and they were later enlightened to know better. Even Archimedes would have been baffled by such modern constructions as pentation and hexation, but nowadays a high school mathematics student can handle the idea (at least, without attempting to visualize it). The day may come to pass when we invent a new way of defining large numbers, some which make the Busy Beaver numbers seem like plain old exponentiation. In those days, perhaps we will define a new model of computation that defies the halting problem and ushers in a new era of mathematical insight and technological prowess.

Indeed, such a model would make long-standing mathematical problems moot. With the power to compute Busy Beaver numbers, it would certainly be trivial to come up with a way to definitively answer the Goldbach Conjecture, or any other mathematical question which can be verified for a single number. In fact, we know already of a simple way to prove the Goldbach Conjecture using Busy Beavers. We leave this as an exercise to the aspiring reader.

Heiner Marxen’s page provides an up-to-date account of the current state of the Busy Beaver Problem, and a list of possible candidates for BB(5), the estimates on BB(6), and links to relevant papers. Scott Aaronson’s essay on Busy Beavers delves deeper into the philosophical implications of solving the Busy Beaver problem (and gives rich analogies), but these considerations, which are essentially fanciful speculation, are beyond the scope of this blog.

Speaking of the scope of this blog, we have yet to write any programs! This will not do, and so next time we will write a program which simulates the currently known busy beavers (or busy beaver candidates) for $n \leq 6$, and displays the changing tape as the beaver runs his route. In a sense, we will be writing a universal Turing machine, which is a commonly used tool in the theory, and not hard to implement.

Until next time!