An Un-PAC-learnable Problem

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

But PAC learning wouldn’t be an interesting model if every concept class was PAC-learnable. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$\displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $\wedge$ denotes a logical AND and the vee $\vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $x_i$ or the negation $\overline{x_i}$, and connectives $\wedge$ and $\vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $\vee$‘s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $G$, we’ll construct a set $S_G$ of labeled truth assignments, and we’ll define the distribution $D$ to be the uniform distribution over those truth assignments used in $S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $S_G$ exactly how we labeled them, and we set the allowed error $\varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $G$ with $n$ nodes $v_1, \dots, v_n$ and a set of $m$ undirected edges $E$, we construct a set of examples with positive labels $S^+$ and one with negative examples $S^-$. The examples are truth assignments to $n$ variables, which we label $x_1, \dots, x_n$, and we identify a truth assignment to the $\left \{ 0,1 \right \}$-valued vector $(x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $S^+$ are simple: for each $v_i$ add a truth assignment $x_i = T, x_j = F$ for $j \neq i$. I.e., the binary vector is $(1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $i$-th position.

The negative examples $S^-$ come from the edges. For each edge $(v_i, v_j) \in E$, we add the example with a zero in the $i$-th and $j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

Claim: $G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $\varphi$.

Again, consistent just means that $\varphi$ is satisfied by every truth assignment in $S^+$ and unsatisfied by every example in $S^-$. Since we chose our distribution to be uniform over $S^+ \cup S^-$, we don’t care what $\varphi$ does elsewhere.

Indeed, if $G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $T_R$ be the AND of all the literals $x_i$ for which vertex $v_i$ is not red. For each such $i$, the corresponding example in $S^+$ will satisfy $T_R$, because we put a zero in the $i$-th position and ones everywhere else. Similarly, no example in $S^-$ will make $T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $(v_1,v_2)$. Then the corresponding negative example is $(0,0,1)$. Unless both $v_1$ and $v_2$ are colored red, one of $x_1, x_2$ will have to be ANDed as part of $T_R$. But the example has a zero for both $x_1$ and $x_2$, so $T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $\varphi$. We need to construct a three coloring of $G$. It goes in largely the same way: label the clauses $\varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $v_i$ the color of the clause that is satisfied by the corresponding example in $S^+$. There must be some clause that does this because $\varphi$ is consistent with $S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $(v_i, v_j)$, and both $v_i$ and $v_j$ are colored, say, blue. Look at the clause $T_B$: since $v_i$ and $v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1′s everywhere else) both make $T_B$ true. Since those two positive examples differ in both their $i$-th and $j$-th positions, $T_B$ can’t have any of the literals $x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $T_B$ because it has 1′s everywhere except $i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $\delta < 1/2$ and $\varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $1 / |S^+ \cup S^-|$, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $\mathsf{C}$ over a set $X$ is efficiently PAC-learnable using the hypothesis class $\mathsf{H}$ if there exists an algorithm $A(\varepsilon, \delta)$ with access to a query function for $\mathsf{C}$ and runtime $O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $c \in \mathsf{C}$, all distributions $D$ over $X$, and all $0 < \delta , \varepsilon < 1/2$, the probability that $A$ produces a hypothesis $h \in \mathsf{H}$ with error at most $\varepsilon$ is at least $1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!

Methods of Proof — Contrapositive

In this post we’ll cover the second of the “basic four” methods of proof: the contrapositive implication. We will build off our material from last time and start by defining functions on sets.

Functions as Sets

So far we have become comfortable with the definition of a set, but the most common way to use sets is to construct functions between them. As programmers we readily understand the nature of a function, but how can we define one mathematically? It turns out we can do it in terms of sets, but let us recall the desired properties of a function:

• Every input must have an output.
• Every input can only correspond to one output (the functions must be deterministic).

One might try at first to define a function in terms of subsets of size two. That is, if $A, B$ are sets then a function $f: A \to B$ would be completely specified by

$\displaystyle \left \{ \left \{ x, y \right \} : x \in A, y \in B \right \}$

where to enforce those two bullets, we must impose the condition that every $x \in A$ occurs in one and only one of those subsets. Notationally, we would say that $y = f(x)$ means $\left \{ x, y \right \}$ is a member of the function. Unfortunately, this definition fails miserably when $A = B$, because we have no way to distinguish the input from the output.

To compensate for this, we introduce a new type of object called a tuple. A tuple is just an ordered list of elements, which we write using round brackets, e.g. $(a,b,c,d,e)$.

As a quick aside, one can define ordered tuples in terms of sets. We will leave the reader to puzzle why this works, and generalize the example provided:

$\displaystyle (a,b) = \left \{ a, \left \{ a, b \right \} \right \}$

And so a function $f: A \to B$ is defined to be a list of ordered pairs where the first thing in the pair is an input and the second is an output:

$\displaystyle f = \left \{ (x, y) : x \in A, y \in B \right \}$

Subject to the same conditions, that each $x$ value from $A$ must occur in one and only one pair. And again by way of notation we say $y = f(x)$ if the pair $(x,y)$ is a member of $f$ as a set. Note that the concept of a function having “input and output” is just an interpretation. A function can be viewed independent of any computational ideas as just a set of pairs. Often enough we might not even know how to compute a function (or it might be provably uncomputable!), but we can still work with it abstractly.

It is also common to call functions “maps,” and to define “map” to mean a special kind of function (that is, with extra conditions) depending on the mathematical field one is working in. Even in other places on this blog, “map” might stand for a continuous function, or a homomorphism. Don’t worry if you don’t know these terms off hand; they are just special cases of functions as we’ve defined them here. For the purposes of this series on methods of proof, “function” and “map” and “mapping” mean the same thing: regular old functions on sets.

Injections

One of the most important and natural properties of a function is that of injectivity.

Definition: A function $f: A \to B$ is an injection if whenever $a \neq a'$ are distinct members of $A$, then $f(a) \neq f(a')$. The adjectival version of the word injection is injective.

As a quick side note, it is often the convention for mathematicians to use a capital letter to denote a set, and a lower-case letter to denote a generic element of that set. Moreover, the apostrophe on the $a'$ is called a prime (so $a'$ is spoken, “a prime”), and it’s meant to denote a variation on the non-prime’d variable $a$ in some way. In this case, the variation is that $a' \neq a$.

So even if we had not explicitly mentioned where the $a, a'$ objects came from, the knowledgeable mathematician (which the reader is obviously becoming) would be reasonably certain that they come from $A$. Similarly, if I were to lackadaisically present $b$ out of nowhere, the reader would infer it must come from $B$.

One simple and commonly used example of an injection is the so-called inclusion function. If $A \subset B$ are sets, then there is a canonical function representing this subset relationship, namely the function $i: A \to B$ defined by $i(a) = a$. It should be clear that non-equal things get mapped to non-equal things, because the function doesn’t actually do anything except change perspective on where the elements are sitting: two nonequal things sitting in $A$ are still nonequal in $B$.

Another example is that of multiplication by two as a map on natural numbers. More rigorously, define $f: \mathbb{N} \to \mathbb{N}$ by $f(x) = 2x$. It is clear that whenever $x \neq y$ as natural numbers then $2x \neq 2y$. For one, $x, y$ must have differing prime factorizations, and so must $2x, 2y$ because we added the same prime factor of 2 to both numbers. Did you catch the quick proof by direct implication there? It was sneaky, but present.

Now the property of being an injection can be summed up by a very nice picture:

A picture example of an injective function.

The arrows above represent the pairs $(x,f(x))$, and the fact that no two arrows end in the same place makes this function an injection. Indeed, drawing pictures like this can give us clues about the true nature of a proposed fact. If the fact is false, it’s usually easy to draw a picture like this showing so. If it’s true, then the pictures will support it and hopefully make the proof obvious. We will see this in action in a bit (and perhaps we should expand upon it later with a post titled, “Methods of Proof — Proof by Picture”).

There is another, more subtle concept associated with injectivity, and this is where its name comes from. The word “inject” gives one the mental picture that we’re literally placing one set $A$ inside another set $B$ without changing the nature of $A$. We are simply realizing it as being inside of $B$, perhaps with different names for its elements. This interpretation becomes much clearer when one investigates sets with additional structure, such as groups, rings, or topological spaces. Here the word “injective mapping” much more literally means placing one thing inside another without changing the former’s structure in any way except for relabeling.

In any case, mathematicians have the bad (but time-saving) habit of implicitly identifying a set with its image under an injective mapping. That is, if $f :A \to B$ is an injective function, then one can view $A$ as the same thing as $f(A) \subset B$. That is, they have the same elements except that $f$ renames the elements of $A$ as elements of $B$. The abuse comes in when they start saying $A \subset B$ even when this is not strictly the case.

Here is an example of this abuse that many programmers commit without perhaps noticing it. Suppose $X$ is the set of all colors that can be displayed on a computer (as an abstract set; the elements are “this particular green,” “that particular pinkish mauve”). Now let $Y$ be the set of all finite hexadecimal numbers. Then there is an obvious injective map from $X \to Y$ sending each color to its 6-digit hex representation. The lazy mathematician would say “Well, then, we might as well say $X \subset Y$, for this is the obvious way to view $X$ as a set of hexadecimal numbers.” Of course there are other ways (try to think of one, and then try to find an infinite family of them!), but the point is that this is the only way that anyone really uses, and that the other ways are all just “natural relabelings” of this way.

The precise way to formulate this claim is as follows, and it holds for arbitrary sets and arbitrary injective functions. If $g, g': X \to Y$ are two such ways to inject $X$ inside of $Y$, then there is a function $h: Y \to Y$ such that the composition $hg$ is precisely the map $g'$. If this is mysterious, we have some methods the reader can use to understand it more fully: give examples for simplified versions (what if there were only three colors?), draw pictures of “generic looking” set maps, and attempt a proof by direct implication.

Proof by Contrapositive

Often times in mathematics we will come across a statement we want to prove that looks like this:

If X does not have property A, then Y does not have property B.

Indeed, we already have: to prove a function $f: X \to Y$ is injective we must prove:

If x is not equal to y, then f(x) is not equal to f(y).

A proof by direct implication can be quite difficult because the statement gives us very little to work with. If we assume that $X$ does not have property $A$, then we have nothing to grasp and jump-start our proof. The main (and in this author’s opinion, the only) benefit of a proof by contrapositive is that one can turn such a statement into a constructive one. That is, we can write “p implies q” as “not q implies not p” to get the equivalent claim:

If Y has property B then X has property A.

This rewriting is called the “contrapositive form” of the original statement. It’s not only easier to parse, but also probably easier to prove because we have something to grasp at from the beginning.

To the beginning mathematician, it may not be obvious that “if p then q” is equivalent to “if not q then not p” as logical statements. To show that they are requires a small detour into the idea of a “truth table.”

In particular, we have to specify what it means for “if p then q” to be true or false as a whole. There are four possibilities: p can be true or false, and q can be true or false. We can write all of these possibilities in a table.

p  q
T  T
T  F
F  T
F  F

If we were to complete this table for “if p then q,” we’d have to specify exactly which of the four cases correspond to the statement being true. Of course, if the p part is true and the q part is true, then “p implies q” should also be true. We have seen this already in proof by direct implication. Next, if p is true and q is false, then it certainly cannot be the case that truth of p implies the truth of q. So this would be a false statement. Our truth table so far looks like

p  q  p->q
T  T   T
T  F   F
F  T   ?
F  F   ?

The next question is what to do if the premise p of “if p then q” is false. Should the statement as a whole be true or false? Rather then enter a belated philosophical discussion, we will zealously define an implication to be true if its hypothesis is false. This is a well-accepted idea in mathematics called vacuous truth. And although it seems to make awkward statements true (like “if 2 is odd then 1 = 0″), it is rarely a confounding issue (and more often forms the punchline of a few good math jokes). So we can complete our truth table as follows

p q  p->q
T T   T
T F   F
F T   T
F F   T

Now here’s where contraposition comes into play. If we’re interested in determining when “not q implies not p” is true, we can add these to the truth table as extra columns:

p  q  p->q  not q   not p   not q -> not p
T  T   T      F       F           T
T  F   F      T       F           F
F  T   T      F       T           T
F  F   T      T       T           T

As we can see, the two columns corresponding to “p implies q” and “not q implies not p” assume precisely the same truth values in all possible scenarios. In other words, the two statements are logically equivalent.

And so our proof technique for contrapositive becomes: rewrite the statement in its contrapositive form, and proceed to prove it by direct implication.

Examples and Exercises

Our first example will be completely straightforward and require nothing but algebra. Let’s show that the function $f(x) = 7x - 4$ is injective. Contrapositively, we want to prove that if $f(x) = f(x')$ then $x = x'$. Assuming the hypothesis, we start by supposing $7x - 4 = 7x' - 4$. Applying algebra, we get $7x = 7x'$, and dividing by 7 shows that $x = x’$ as desired. So $f$ is injective.

This example is important because if we tried to prove it directly, we might make the mistake of assuming algebra works with $\neq$ the same way it does with equality. In fact, many of the things we take for granted about equality fail with inequality (for instance, if $a \neq b$ and $b \neq c$ it need not be the case that $a \neq c$). The contrapositive method allows us to use our algebraic skills in a straightforward way.

Next let’s prove that the composition of two injective functions is injective. That is, if $f: X \to Y$ and $g: Y \to Z$ are injective functions, then the composition $gf : X \to Z$  defined by $gf(x) = g(f(x))$ is injective.

In particular, we want to prove that if $x \neq x'$ then $g(f(x)) \neq g(f(x'))$. Contrapositively, this is the same as proving that if $g(f(x)) = g(f(x'))$ then $x=x'$. Well by the fact that $g$ is injective, we know that (again contrapositively) whenever $g(y) = g(y')$ then $y = y'$, so it must be that $f(x) = f(x')$. But by the same reasoning $f$ is injective and hence $x = x'$. This proves the statement.

This was a nice symbolic proof, but we can see the same fact in a picturesque form as well:

A composition of two injections is an injection.

If we maintain that any two arrows in the diagram can’t have the same head, then following two paths starting at different points in $X$ will never land us at the same place in $Z$. Since $f$ is injective we have to travel to different places in $Y$, and since $g$ is injective we have to travel to different places in $Z$. Unfortunately, this proof cannot replace the formal one above, but it can help us understand it from a different perspective (which can often make or break a mathematical idea).

Expanding upon this idea we give the reader a challenge: Let $A, B, C$ be finite sets of the same size. Prove or disprove that if $f: A \to B$ and $g: B \to C$ are (arbitrary) functions, and if the composition $gf$ is injective, then both of $f, g$ must be injective.

Another exercise which has a nice contrapositive proof: prove that if $A,B$ are finite sets and $f:A \to B$ is an injection, then $A$ has at most as many elements as $B$. This one is particularly susceptible to a “picture proof” like the one above. Although the formal the formal name for the fact one uses to prove this is the pigeonhole principleit’s really just a simple observation.

Aside from inventing similar exercises with numbers (e.g., if $ab$ is odd then $a$ is odd or $b$ is odd), this is all there is to the contrapositive method. It’s just a direct proof disguised behind a fact about truth tables. Of course, as is usual in more advanced mathematical literature, authors will seldom announce the use of contraposition. The reader just has to be watchful enough to notice it.

Though we haven’t talked about either the real numbers $\mathbb{R}$ nor proofs of existence or impossibility, we can still pose this interesting question: is there an injective function from $\mathbb{R} \to \mathbb{N}$? In truth there is not, but as of yet we don’t have the proof technique required to show it. This will be our next topic in the series: the proof by contradiction.

Until then!

Why there is no Hitchhiker’s Guide to Mathematics for Programmers

For those who aren’t regular readers: as a followup to this post, there are four posts detailing the basic four methods of proof, with intentions to detail some more advanced proof techniques in the future. You can find them on this blog’s primers page.

Do you really want to get better at mathematics?

Remember when you first learned how to program? I do. I spent two years experimenting with Java programs on my own in high school. Those two years collectively contain the worst and most embarrassing code I have ever written. My programs absolutely reeked of programming no-nos. Hundred-line functions and even thousand-line classes, magic numbers, unreachable blocks of code, ridiculous code comments, a complete disregard for sensible object orientation, negligence of nearly all logic, and type-coercion that would make your skin crawl. I committed every naive mistake in the book, and for all my obvious shortcomings I considered myself a hot-shot programmer! At least I was learning a lot, and I was a hot-shot programmer in a crowd of high-school students interested in game programming.

Even after my first exposure and my commitment to get a programming degree in college, it was another year before I knew what a stack frame or a register was, two more before I was anywhere near competent with a terminal, three more before I fully appreciated functional programming, and to this day I still have an irrational fear of networking and systems programming (the first time I manually edited the call stack I couldn’t stop shivering with apprehension and disgust at what I was doing).

I just made this function call return to a *different* place than where it was called from.

In a class on C++ programming I was programming a Checkers game, and my task at the moment was to generate a list of all possible jump-moves that could be made on a given board. This naturally involved a depth-first search and a couple of recursive function calls, and once I had something I was pleased with, I compiled it and ran it on my first non-trivial example. Low and behold (even having followed test-driven development!), I was hit hard in the face by a segmentation fault. It took hundreds of test cases and more than twenty hours of confusion before I found the error: I was passing a reference when I should have been passing a pointer. This was not a bug in syntax or semantics (I understood pointers and references well enough) but a design error. And the aggravating part, as most programmers know, was that the fix required the change of about 4 characters. Twenty hours of work for four characters! Once I begrudgingly verified it worked (of course it worked, it was so obvious in hindsight), I promptly took the rest of the day off to play Starcraft.

Of course, as every code-savvy reader will agree, all of this drama is part of the process of becoming and strong programmer. One must study the topics incrementally, make plentiful mistakes and learn from them, and spend uncountably many hours in a state of stuporous befuddlement before one can be considered an experienced coder. This gives rise to all sorts of programmer culture, unix jokes, and reverence for the masters of C that make the programming community so lovely to be a part of. It’s like a secret club where you know all the handshakes. And should you forget one, a crafty use of awk and sed will suffice.

“Semicolons of Fury” was the name of my programming team in the ACM collegiate programming contest. We placed Cal Poly third in the Southern California Regionals, and in my opinion our success was due in large part to the dynamics of our team. I (center, in blue) have since gotten a more stylish haircut.

Now imagine someone comes along and says,

“I’m really interested in learning to code, but I don’t plan to write any programs and I absolutely abhor tracing program execution. I just want to use applications that others have written, like Chrome and iTunes.”

You would laugh at them! And the first thing that would pass through your mind is either, “This person would give up programming after the first twenty minutes,” or “I would be doing the world a favor by preventing this person from ever writing a program. This person belongs in some other profession.” This lies in stark opposition to the common chorus that everyone should learn programming. After all, it’s a constructive way to think about problem solving and a highly employable skill. In today’s increasingly technological world, it literally pays to know your computer better than a web browser. (Ironically, I’m writing this on my Chromebook, but in my defense it has a terminal with ssh. Perhaps more ironically, all of my real work is done with paper and pencil.)

Unfortunately this sentiment is mirrored among most programmers who claim to be interested in mathematics. Mathematics is fascinating and useful and doing it makes you smarter and better at problem solving. But a lot of programmers think they want to do mathematics, and they either don’t know what “doing mathematics” means, or they don’t really mean they want to do mathematics. The appropriate translation of the above quote for mathematics is:

“Mathematics is useful and I want to be better at it, but I won’t write any original proofs and I absolutely abhor reading other people’s proofs. I just want to use the theorems others have proved, like Fermat’s Last Theorem and the undecidability of the Halting Problem.”

Of course no non-mathematician is really going to understand the current proof of Fermat’s Last Theorem, just as no fledgling programmer is going to attempt to write a (quality) web browser. The point is that the sentiment is in the wrong place. Mathematics is cousin to programming in terms of the learning curve, obscure culture, and the amount of time one spends confused. And mathematics is as much about writing proofs as software development is about writing programs (it’s not everything, but without it you can’t do anything). Honestly, it sounds ridiculously obvious to say it directly like this, but the fact remains that people feel like they can understand the content of mathematics without being able to write or read proofs.

I want to devote the rest of this post to exploring some of the reasons why this misconception exists. My main argument is that the reasons have to do more with the culture of mathematics than the actual difficulty of the subject. Unfortunately as of the time of this writing I don’t have a proposed “solution.” And all I can claim is a problem is that programmers can have mistaken views of what mathematics involves. I don’t propose a way to make mathematics easier for programmers, although I do try to make the content on my blog as clear as possible (within reason). I honestly do believe that the struggle and confusion builds mathematical character, just as the arduous bug-hunt builds programming character. If you want to be good at mathematics, there is no other way.

All I want to do with this article is to detail why mathematics can be so hard for beginners, to explain a few of the secret handshakes, and hopefully to bring an outsider a step closer to becoming an insider. And I want to stress that this is not a call for all programmers to learn mathematics. Far from it! I just happen to notice that, for good reason, the proportion of programmers who are interested in mathematics is larger than in most professions. And as a member of both communities, I want to shed light on why mathematics can be difficult for an otherwise smart and motivated software engineer.

So read on, and welcome to the community.

Travelling far and wide

Perhaps one of the most prominent objections to devoting a lot of time to mathematics is that it can be years before you ever apply mathematics to writing programs. On one hand, this is an extremely valid concern. If you love writing programs and designing software, then mathematics is nothing more than a tool to help you write better programs.

But on the other hand, the very nature of mathematics is what makes it so applicable, and the only way to experience nature is to ditch the city entirely. Indeed, I provide an extended example of this in my journalesque post on introducing graph theory to high school students: the point of the whole exercise is to filter out the worldly details and distill the problem into a pristine mathematical form. Only then can we see its beauty and wide applicability.

Here is a more concrete example. Suppose you were trying to encrypt the contents of a message so that nobody could read it even if they intercepted the message in transit. Your first ideas would doubtlessly be the same as those of our civilization’s past: substitution ciphers, Vigenere ciphers, the Enigma machine, etc. Regardless of what method you come up with, your first thought would most certainly not be, “prime numbers so big they’ll make your pants fall down.” Of course, the majority of encryption methods today rely on very deep facts (or rather, conjectures) about prime numbers, elliptic curves, and other mathematical objects (“group presentations so complicated they’ll orient your Mobius band,” anyone?). But it took hundreds of years of number theory to get there, and countless deviations into other fields and dead-ends. It’s not that the methods themselves are particularly complicated, but the way they’re often presented (and this is unavoidable if you’re interested in new mathematical breakthroughs) is in the form of classical mathematical literature.

Of course there are other examples much closer to contemporary fashionable programming techniques. One such example is boosting. While we have yet to investigate boosting on this blog, the basic idea is that one can combine a bunch of algorithms which perform just barely better than 50% accuracy, and collectively they will be arbitrarily close to perfect. In a field dominated by practical applications, this result is purely the product of mathematical analysis.

And of course boosting in turn relies on the mathematics of probability theory, which in turn relies on set theory and measure theory, which in turn relies on real analysis, and so on. One could get lost for a lifetime in this mathematical landscape! And indeed, the best way to get a good view of it all is to start at the bottom. To learn mathematics from scratch. The working programmer simply doesn’t have time for that.

What is it really, that people have such a hard time learning?

Most of the complaints about mathematics come understandably from notation and abstraction. And while I’ll have more to say on that below, I’m fairly certain that the main obstacle is a familiarity with the basic methods of proof.

While methods of proof are semantical by nature, in practice they form a scaffolding for all of mathematics, and as such one could better characterize them as syntactical. I’m talking, of course, about the four basics: direct implication, proof by contradiction, contrapositive, and induction. These are the loops, if statements, pointers, and structs of rigorous argument, and there is simply no way to understand the mathematics without a native fluency in this language.

The “Math Major Sloth” is fluent. Why aren’t you?

So much of mathematics is built up by chaining together a multitude of absolutely trivial statements which are amendable to proof by the basic four. I’m not kidding when I say they are absolutely trivial. A professor of mine once said,

If it’s not completely trivial, then it’s probably not true.

I can’t agree more with this statement. Of course, there are many sophisticated proofs in mathematics, but an overwhelming majority of (very important) facts fall in the trivial category. That being said, trivial can be sometimes relative to one’s familiarity with a subject, but that doesn’t make the sentiment any less right. Drawing up a shopping list is trivial once you’re comfortable with a pencil and paper and you know how to write (and you know what the words mean). There are certainly works of writing that require a lot more than what it takes to write a shopping list. Likewise, when we say something is trivial in mathematics, it’s because there’s no content to the proof outside of using definitions and a typical application of the basic four methods of proof. This is the “holding a pencil” part of writing a shopping list.

And as you probably know, there are many many more methods of proof than just the basic four. Proof by construction, by exhaustion, case analysis, and even picture proofs have a place in all fields of mathematics. More relevantly for programmers, there are algorithm termination proofs, probabilistic proofs, loop invariants to design and monitor, and the ubiquitous NP-hardness proofs (I’m talking about you, Travelling Salesman Problem!). There are many books dedicated to showcasing such techniques, and rightly so. Clever proofs are what mathematicians strive for above all else, and once a clever proof is discovered, the immediate first step is to try to turn it into a general method for proving other facts. Fully flushing out such a process (over many years, showcasing many applications and extensions) is what makes one a world-class mathematician.

An entire book dedicated to the probabilistic method of proof, invented by Paul Erdős and sown into the soil of mathematics over the course of his lifetime.

Another difficulty faced by programmers new to mathematics is the inability to check your proof absolutely. With a program, you can always write test cases and run them to ensure they all pass. If your tests are solid and plentiful, the computer will catch your mistakes and you can go fix them.

There is no corresponding “proof checker” for mathematics. There is no compiler to tell you that it’s nonsensical to construct the set of all sets, or that it’s a type error to quotient a set by something that’s not an equivalence relation. The only way to get feedback is to seek out other people who do mathematics and ask their opinion. In solo, mathematics involves a lot of backtracking, revising mistaken assumptions, and stretching an idea to its breaking point to see that it didn’t even make sense to begin with. This is “bug hunting” in mathematics, and it can often completely destroy a proof and make one start over from scratch. It feels like writing a few hundred lines of code only to have the final program run “rm -rf *” on the directory containing it. It can be really. really. depressing.

It is an interesting pedagogical question in my mind whether there is a way to introduce proofs and the language of mature mathematics in a way that stays within a stone’s throw of computer programs. It seems like a worthwhile effort, but I can’t think of anyone who has sought to replace a classical mathematics education entirely with one based on computation.

Mathematical syntax

Another major reason programmers are unwilling to give mathematics an honest effort is the culture of mathematical syntax: it’s ambiguous, and there’s usually nobody around to explain it to you. Let me start with an example of why this is not a problem in programming. Let’s say we’re reading a Python program and we see an expression like this:

foo[2]

The nature of (most) programming languages dictates that there are a small number of ways to interpret what’s going on in here:

1. foo could be a list/tuple, and we’re accessing the third element in it.
2. foo could be a dictionary, and we’re looking up value associated to the key 2.
3. foo could be a string, and we’re extracting the third character.
4. foo could be a custom-defined object, whose __getitem__ method is defined somewhere else and we can look there to see exactly what it does.

There are probably other times this notation can occur (although I’d be surprised if number 4 didn’t by default capture all possible uses), but the point is that any programmer reading this program knows enough to intuit that square brackets mean “accessing an item inside foo with identifier 2.” Part of the reasons that programs can be very easy to read is precisely because someone had to write a parser for a programming language, and so they had to literally enumerate all possible uses of any expression form.

The other extreme is the syntax of mathematics. The daunting fact is that there is no bound to what mathematical notation can represent, and much of mathematical notation is inherently ad hoc. For instance, if you’re reading a math paper and you come across an expression that looks like this

$\delta_i^j$

The possibilities of what this could represent are literally endless. Just to give the unmathematical reader a taste: $\delta_i$ could be an entry of a sequence of numbers of which we’re taking arithmetic $j^\textup{th}$ powers. The use of the letter delta could signify a slightly nonstandard way to write the Kronecker delta function, for which $\delta_i^j$ is one precisely when $i=j$ and zero otherwise. The superscript $j$ could represent dimension. Indeed, I’m currently writing an article in which I use $\delta^k_n$ to represent $k$-dimensional simplex numbers, specifically because I’m relating the numbers to geometric objects called simplices, and the letter for those is  a capital $\Delta$. The fact is that using notation in a slightly non-standard way does not invalidate a proof in the way that it can easily invalidate a program’s correctness.

What’s worse is that once mathematicians get comfortable with a particular notation, they will often “naturally extend” or even silently drop things like subscripts and assume their reader understands and agrees with the convenience! For example, here is a common difficulty that beginners face in reading math that involves use of the summation operator. Say that I have a finite set of numbers whose sum I’m interested in. The most rigorous way to express this is not far off from programming:

Let $S = \left \{ x_1, \dots, x_n \right \}$ be a finite set of things. Then their sum is finite:

$\displaystyle \sum_{i=1}^n x_i$

The programmer would say “great!” Assuming I know what “+” means for these things, I can start by adding $x_1 + x_2$, add the result to $x_3$, and keep going until I have the whole sum. This is really just a left fold of the plus operator over the list $S$.

But for mathematicians, the notation is far more flexible. For instance, I could say

Let $S$ be finite. Then $\sum_{x \in S} x$ is finite.

Things are now more vague. We need to remember that the $\in$ symbol means “in.” We have to realize that the strict syntax of having an iteration variable $i$ is no longer in effect. Moreover, the order in which the things are summed (which for a left fold is strictly prescribed) is arbitrary. If you asked any mathematician, they’d say “well of course it’s arbitrary, in an abelian group addition is commutative so the order doesn’t matter.” But realize, this is yet another fact that the reader must be aware of to be comfortable with the expression.

But it still gets worse.

In the case of the capital Sigma, there is nothing syntactically stopping a mathematician from writing

$\displaystyle \sum_{\sigma \in \Sigma} f_{\Sigma}(\sigma)$

Though experienced readers may chuckle, they will have no trouble understanding what is meant here. That is, syntactically this expression is unambiguous enough to avoid an outcry: $\Sigma$ just happens to also be a set, and saying $f_{\Sigma}$ means that the function $f$ is constructed in a way that depends on the choice of the set $\Sigma$. This often shows up in computer science literature, as $\Sigma$ is a standard letter to denote an alphabet (such as the binary alphabet $\left \{ 0,1 \right \}$).

One can even take it a step further and leave out the set we’re iterating over, as in

$\displaystyle \sum_{\sigma} f_{\Sigma}(\sigma)$

since it’s understood that the lowercase letter ($\sigma$) is usually an element of the set denoted by the corresponding uppercase letter ($\Sigma$). If you don’t know greek and haven’t seen that coincidence enough times to recognize it, you would quickly get lost. But programmers must realize: this is just the mathematician’s secret handshake. A mathematician would be just as bewildered and confused upon seeing some of the pointer arithmetic hacks C programmers invent, or the always awkward infinite for loop, if they had not had enough experience dealing with the syntax of standard for loops.

for (;;) {
;
}

In fact, a mathematician would look at this in disgust! The fact that the C programmer has need for something as pointless as an “empty statement” should be viewed as a clumsy inelegance in the syntax of the programming language (says the mathematician). Since mathematicians have the power to change their syntax at will, they would argue there’s no good reason to change it, if it were a mathematical expression, to something simpler.

And once the paper you’re reading is over, and you start reading a new paper, chances are their conventions and notation will be ever-so-slightly different, and you have to keep straight what means what. It’s as if the syntax of a programming language changed depending on who was writing the program!

Perhaps understandably, the frustration that most mathematicians feel when dealing with varying syntax across different papers and books is collectively called “technicalities.” And the more advanced the mathematics becomes, the ability to fluidly transition between high-level intuition and technical details is all but assumed.

The upshot of this whole conversation is that the reader of a mathematical proof must hold in mind a vastly larger body of absorbed (and often frivolous) knowledge than the reader of a computer program.

At this point you might see all of this as my complaining, but in truth I’m saying this notational flexibility and ambiguity is a benefit. Once you get used to doing mathematics, you realize that technical syntax can make something which is essentially simple seem much more difficult than it is. In other words, we absolutely must have a way to make things completely rigorous, but in developing and presenting proofs the most important part is to make the audience understand the big picture, see intuition behind the symbols, and believe the proofs. For better or worse, mathematical syntax is just a means to that end, and the more abstract the mathematics becomes, the more flexiblility mathematicians need to keep themselves afloat in a tumultuous sea of notation.

You’re on your own, unless you’re around mathematicians

That brings me to my last point: reading mathematics is much more difficult than conversing about mathematics in person. The reason for this is once again cultural.

Imagine you’re reading someone else’s program, and they’ve defined a number of functions like this (pardon the single-letter variable names, I’m just don’t like “foo” and “bar”).

def splice(L):
...

def join(*args):
...

def flip(x, y):
...

There are two parts to understanding how these functions work. The first part is that someone (or a code comment) explains to you in a high level what they do to an input. The second part is to weed out the finer details. These “finer details” are usually completely spelled out by the documentation, but it’s still a good practice to experiment with it yourself (there is always the possibility for bugs or unexpected features, of course).

In mathematics there is no unified documentation, just a collective understanding, scattered references, and spoken folk lore. You’re lucky if a textbook has a table of notation in the appendix. You are expected to derive the finer details and catch the errors yourself. Even if you are told the end result of a proposition, it is often followed by, “The proof is trivial.” This is the mathematician’s version of piping output to /dev/null, and literally translates to, “You’re expected to be able to write the proof yourself, and if you can’t then maybe you’re not ready to continue.”

Indeed, the opposite problems are familiar to a beginning programmer when they aren’t in a group of active programmers. Why is it that people give up or don’t enjoy programming? Is it because they have a hard time getting honest help from rudely abrupt moderators on help websites like stackoverflow? Is it because often when one wants to learn the basics, they are overloaded with the entirety of the documentation and the overwhelming resources of the internet and all its inhabitants? Is it because compiler errors are nonsensically exact, but very rarely helpful? Is it because when you learn it alone, you are bombarded with contradicting messages about what you should be doing and why (and often for the wrong reasons)?

All of these issues definitely occur, and I see them contribute to my students’ confusion in my introductory Python class all the time. They try to look on the web for information about how to solve a very basic problem, and they come back to me saying they were told it’s more secure to do it this way, or more efficient to do it this way, or that they need to import something called the “heapq module.” When really the goal is not to solve the problem in the best way possible or in the shortest amount of code, but to show them how to use the tools they already know about to construct a program that works. Without a guiding mentor it’s extremely easy to get lost in the jungle of people who think they know what’s best.

As far as I know there is no solution to this problem faced by the solo programming student (or the solo anything student). And so it stands for mathematics: without others doing mathematics with you, its very hard to identify your issues and see how to fix them.

Proofs, Syntax, and Community

For the programmer who is truly interested in improving their mathematical skills, the first line of attack should now be obvious. Become an expert at applying the basic methods of proof. Second, spend as much time as it takes to clear up what mathematical syntax means before you attempt to interpret the semantics. And finally, find others who are interested in seriously learning some mathematics, and work on exercises (perhaps a weekly set) with them. Start with something basic like set theory, and write your own proofs and discuss each others’ proofs. Treat the sessions like code review sessions, and be the compiler to your partner’s program. Test their arguments to the extreme, and question anything that isn’t obvious or trivial. It’s not uncommon for easy questions with simple answers and trivial proofs to create long and drawn out discussions before everyone agrees it’s obvious. Embrace this and use it to improve.

Short of returning to your childhood and spending more time doing recreational mathematics, that is the best advice I can give.

Until next time!

P vs. NP, A Primer (And a Proof Written in Racket)

Decidability Versus Efficiency

In the early days of computing theory, the important questions were primarily about decidability. What sorts of problems are beyond the power of a Turing machine to solve? As we saw in our last primer on Turing machines, the halting problem is such an example: it can never be solved a finite amount of time by a Turing machine. However, more recently (in the past half-century) the focus of computing theory has shifted away from possibility in favor of determining feasibility. In particular, we want to know how fast we can solve a particular problem on a Turing machine. The goal is to design efficient algorithms for important real-world problems, and know when it is impossible to design a more efficient algorithm than what we already have. From a mathematical perspective, that often means we would like to find the boundary problems: those which are “just barely” too hard to admit an efficient algorithm. Of course, this requires us to first define what it means for an algorithm to be “efficient.”

These questions are intimately intertwined with big-O analysis, which we presented in our very first primer on this blog, and the definitions we investigate in the sequel require fluency in such notation. That being said, we try to give colloquial analogues to the most important ideas.

The Class P

The general goal here is to describe classes of problems that roughly correspond to how efficiently one can solve them. The first such class (and by all rights, the smallest class we will consider) is called P, which stands for “polynomial.”

Definition: P is the set of languages which can be decided by a Turing machine in $O(n^k)$ for some $k$, where $n$ is the input size of the problem. In the common notation:

$\displaystyle \textup{P} = \bigcup_{k \in \mathbb{N}} \textup{TIME}(n^k)$

Where $\textup{TIME}(n^k)$ is the set of problems decidable in $O(n^k)$ time. Recall that a language is any set of strings (any subset of $\Sigma^*$, as we have been investigating in our past computing theory primers and our post about metrics on words).

We don’t have too much to say about P itself, except that many common problems are in P. The real difficulty is in proving that some problem does not lie in P.

Here is an example this author has used to explain to elementary school children what’s it’s like to not be in P. Consider a deck of 52 playing cards, and your goal is to sort the cards by suit and then by rank (as it is when you open a fresh pack of cards). There is one obvious way to do this that children teach themselves quite readily: just look for the Ace of spades, put it on the bottom of a new stack. Then find the two of spades, and continue in this manner until the deck is sorted. This algorithm takes $O(n^2)$, since in the very worst case the deck is sorted in reverse order, and if you’re too dim to realize it, you look through the entire deck at every step.

Of course, there are much faster ways to do it than the approach we detailed above, and it is a well-known theorem that sorting is $\Theta(n \log(n))$. That is, there is both a lower bound and an upper bound of $n \log(n)$ for the general sorting problem.

Now let’s investigate a ridiculous algorithm for sorting cards. What if at each step, we shuffled the deck, and then checked to see if it magically became sorted due to our shuffling. Forgetting for a moment that randomness comes into play, we would find the correct sorting once in $52!$ trials, and it would take 52 steps to check if it was sorted. For a deck of $n$ cards, this would hence take time $O(n! \cdot n)$, which would take so long even for 52 cards, that the sun would go out before you could finish. But what if we didn’t know a better way?

For an example of a problem where we don’t know a better way, consider the problem of graph coloring. Is there a Turing machine which, when given as input a graph $G$ and a number $k$, can determine in polynomial time whether $G$ admits a $k$-coloring? There is an obvious algorithm to decide the problem: there are only finitely many choices of assignments of vertices to colors, and we can simply check them all. In fact, there are $k^n$ of them, where $n$ is the number of vertices of $G$.

Unfortunately, this algorithm is not polynomial in runtime: we would have to enumerate all of the different colorings, and check whether each is a valid coloring; this process is easily seen to be $o(nk^n)$, which is far from polynomial for arbitrary $k > 2$.

But the true challenge is this: how do we know there is no “faster” algorithm? Of all the crazy wild ideas one could have to solve a problem, how do we know none can be so clever that they reduce the running time to be a polynomial in $n$?

In fact, we don’t know for sure that there isn’t! This is the heart of the open problem which is succinctly called “P vs. NP”.

The Class NP

While P is a class of “easy” problems from one perspective (problems that can be solved quickly, even in the worst case), being a member of NP is another measure of “easiness,” but from a different perspective.

Definition: NP is the class of problems which have a polynomial-time verifier. That is, given an input $w$ to a problem $A \in \textup{NP}$, there is a string $c$ called a certificate and a Turing machine $M$ for which $M$ verifies that $c$ proves $w \in A$ and runs in polynomial time.

This definition is a bit hard to swallow, but examples clarify the situation greatly. For the problem of graph coloring, we note that a certificate would simply be a list of pairs $(v_i, n_i)$ which give a coloring of the graph $G$. It is quite trivial to define a polynomial-time Turing machine that ensures the coloring of $G$ is valid. Hence, graph coloring is in NP. This is the case with most problems in NP: a proof that $w \in A$ is hard to find, but easy to verify once you have it.

There is another definition of NP which is often useful, and it gives a reason for prefixing the “polynomial” part of P with an N.

Definition: NP is the set of problems which are solvable by a nondeterministic Turing machine in polynomial time.

For the motivating picture behind “nondeterministic Turing machines,” we turn to an analogy. Imagine you have an infinite number of computers running in parallel, and they can communicate instantaneously. What sorts of problems could we solve in polynomial time with such a machine? We could certainly solve graph coloring: simply have each machine try one of the $k^n$ different colorings, and have the entire machine halt when one coloring is found (or when they all finish, we can safely reject that the graph is k-colorable).

So we can reformulate the definition in set notation as:

$\displaystyle \textup{NP} = \bigcup_{k \in \mathbb{N}} \textup{NTIME}(n^k)$

Here $\textup{NTIME}(f(n))$ is the set of all languages which can be solved in time $f(n)$ on a nondeterministic Turing machine.

In other words, “NP” stands for “nondeterministic polynomial-time” problems. And in fact this definition is equivalent to existence of a polynomial-time verifier, as in our first definition. To see this, note that we can construct a nondeterministic machine that enumerates all possible certificates (lending to our analogy, one on each of the infinitely numerous parallel computers), and then tests them using the polynomial-time verifier. Since each branch uses a polynomial-time algorithm, the whole Turing machine runs in (nondeterministic) polynomial time. On the other hand, if some branch of computation halts in deterministic time, then the sequence of configurations of the tape for that branch has polynomial length, and so a Turing machine can simply run that computation to ensure it follows the rules of Turing machines and ends in an accepting state. This clearly represents a certificate.

One might ask why we care about infinitely parallel computers. In reality, we can only have finitely many computations going at once, so why bother with this silly mathematical uselessness? As it often turns out in mathematics, it is useful to think about such structures simply because they capture the essence of what we wish to study in a concise and formal manner. For complexity theory, nondeterministic Turing machines capture the level of complexity we wish to understand: in a sense, it’s the “next level up” from polynomial-time decidable problems.

K-Clique and 3-Sat

We have two more examples of problems in NP which will be important later in this post: the problems of 3-Satisfiability and finding a k-Clique.

Definition: Let $\varphi(x_1, \dots, x_n)$ be a propositional formula in $n$ boolean variables. $\varphi$ is satisfiable if there exists an assignment of the variables $x_1, \dots, x_n \to \left \{ \textup{T}, \textup{F} \right \}$ that makes $\varphi$ true.

For example, we could have the formula

$\displaystyle (x \vee y \vee \overline{z}) \wedge (\overline{x} \vee \overline{y} \vee z)$

And this is satisfiable by the assignment of $x = \textup{T}, y = \textup{F}, z = \textup{T}$.

It should be obvious to the reader now that determining whether a formula is satisfiable is in NP: a certificate is just a list of variables mapping to true or false, and checking that the formula is satisfied by a given assignment can be done in polynomial time.

In 3-Satisfiability (usually shortened to 3-Sat), we simply restrict the form of $\varphi$ and ask the same question. The form is called conjunctive normal form, and colloquially it is a bunch of clauses joined with “and,” where each clause is a bunch of variables (and their negations) connected by “or”. Moreover, the “3″ in 3-Sat requires that each clause contain exactly three literals. For example, the equation given above would be a valid input to 3-Sat.

k-Clique, on the other hand, is a question about graphs. Given a graph $G$ and a positive integer $k$, determine whether $G$ contains a complete subgraph of $k$ vertices. (In a complete graph, there is an edge connecting each pair of vertices; the name “clique” is motivated by the party problem, in the sense that there is a “clique” of friends at the party who are all mutually friends with each other.)

As expected, a certificate that a graph has a $k$-Clique is just a list of the vertices in the clique, and checking that all pairs of vertices listed have a connecting edge is easily seen to take $O(|G|k^2) = O(n^3)$, which is polynomial in the size of the input.

NP-Completeness

As it turns out, the problems of 3-Satisfiability and k-Clique are quite special (as is graph coloring). They belong to a special sub-class of NP called NP-complete. Before we can define what NP-complete means, we have to be able to compare problems in NP.

Definition: Given two languages $A, B$, we say $A \leq_p B$, or $A$ is polynomial-time reducible to $B$ if there exists a computable function $f: \Sigma^* \to \Sigma^*$ such that $w \in A$ if and only if $f(w) \in B$, and $f$ can be computed in polynomial time.

We have seen this same sort of idea with mapping reducibility in our last primer on Turing machines. Given a language $B$ that we wanted to show as undecidable, we could show that if we had a Turing machine which decided $B$, we could solve the halting problem. This is precisely the same idea: given a solution for $B$ and an input for $A$, we can construct in polynomial time an input for $B$, use a decider for $B$ to solve it, and then output accordingly. The only new thing is that the conversion of the input must happen in polynomial time.

Of course, this discussion was the proof of a clarifying lemma:

Lemma: If $B \in \textup{P}$ and $A \leq_p B$, then $A \in \textup{P}$.

The proof is immediate, as $B$ can be solved in polynomial time, and the conversion function runs in polynomial time. We leave the construction of an explicit Turing machine to decide $A$ as an exercise to the reader.

To phrase things more colloquially, $A \leq_p B$ is true if $A$ is an easier” problem than $B$, hence justifying the less-than notation.

And now for the amazing part: there are problems in NP which are harder than all other problems in NP.

Definition: A language $A \in \textup{NP}$ is called NP-complete if for all problems $B \in \textup{NP}$, $B \leq_p A$.

In other words, all problems in NP reduce to an NP-complete problem in polynomial time. In fact, we get another nice fact about NP-completeness that mirrors our observation about P above:

Lemma: If $A$ is NP-complete and $B \in \textup{NP}$ with $A \leq_p B$, then $B$ is NP-complete.

Obviously the composition of two polynomial-time reductions is a polynomial-time reduction, so we can conclude that all problems in NP which reduce to $A$ also reduce to $B$.

The cautious reader should be rather hesitant to believe that NP-complete problems should even exist. There is no reason we can’t come up with harder and harder problems, so why should there be a point after which we can’t quickly verify a solution?

Well, Stephen Cook proved in 1971 that there is an NP-complete problem, and shortly thereafter many more were found. Today, there are thousands of known NP-complete problems.

Perhaps unsurprisingly, Cook’s original NP-complete problem was N-Satisfiability (i.e., 3-Satisfiability without a constraint on the number of clauses or the form). Unfortunately the proof is quite difficult. We point the reader to the relevant Wikipedia page, and briefly mention the outline of a proof.

Given a nondeterministic polynomial-time Turing machine, we can bound the number of parallel computations and the length of each computation by $n^k$ for some fixed $k$. Then we create a $n^k$ by $n^k$ table of the configurations of the Turing machine (the i,j cell for the i-th branch of computation and the j-th step). From this, we can construct a monstrously enormous (yet polynomial in size) formula which has a satisfying assignment if and only if the Turing machine halts on some branch of computation in an accept state. Here is a table of the formulas needed to do this. In short, the formula traces the computation of the machine at each step, and ensures the transition function is honestly followed, the tape is reliably updated, and the head of each tape moves in the correct direction.

The reason we say it’s unsurprising that Satisfiability is NP-complete is because it’s commonly believed that every aspect of mathematics boils down to pure logic, although the process of doing so is entrenched in gory detail every step of the way. So it’s understandable that all problems in NP reduce to a problem about logic which is also in NP. We stipulate that other complexity classes likely have “complete” problems that are essentially questions about logical formulas.

A New Way to Find NP-Complete Problems

Now that we have established the NP-completeness of Satisfiability, we can do the same for other problems by reduction from a known NP-complete problem. First, we claim that 3-Satisfiability is NP-complete, and we leave the proof as an exercise to the reader (hint: reduce from regular Satisfiability by putting the formula into the right form).

Now given that 3-Sat is NP-complete, we will prove that k-Clique is NP-complete, by reduction from 3-Sat (in fact our conversion function will work for any formulas in conjunctive normal form, but 3 is enough).

Theorem: k-Clique is NP-complete.

Proof. Given a formula $\varphi$ in conjunctive normal form, we construct an instance of k-Clique as follows. First, let $k$ be the number of clauses in $\varphi$. Construct a graph $G_{\varphi}$ by creating a vertex for each literal term in $\varphi$, and (to help visualization) organize them into columns by their originating clause, and label the vertex with its corresponding literal. Introduce an edge connecting two terms $a, b$ in different columns when the formula $a \wedge b$ is not a contradiction. In other words, it cannot be of the form $x \wedge \overline{x}$ for some variable $x$.

As a concrete example, the formula

$\displaystyle (x \vee \overline{y} \vee z) \wedge (\overline{x} \vee \overline{z} \vee w)$

converts to the graph

We claim that $\varphi$ has a satisfying assignment of variables if and only if $G_{\varphi}$ has a k-clique. Supposing there is a valid assignment of variables in $\varphi$, then there must be one variable in each clause which is true (and hence $k$ variables). This translates to $k$ vertices in $G_{\varphi}$, one vertex in each column which is true, and none of these vertices conflict with each other, so $G_{\varphi}$ has an edge connecting each pair of the $k$ vertices. Conversely, suppose $G_{\varphi}$ has a k-clique. By our construction, two edges in the same column cannot be connected by and edge, and hence this k-clique must have one vertex in every column. If the vertex is labeled with a negation, assign it to the value $F$ (so that the literal evaluates to true), and otherwise assign it the value $T$. This gives a satisfying assignment of the variables of $\varphi$, since each clause will evaluate to true under this assignment.

The final part of the proof is that the conversion function runs in polynomial time, and we claim this is obvious from the construction: if $\varphi$ has $n$ literals, then we create $O(n)$ vertices and $O(n^2)$ edges. The creation of each vertex and edge can be done in constant time, as can the verification that two literals do not conflict. $\square$

Of course, this is a question about the possibilities of computers. Instead of giving a theoretical proof, why not just write a program to compute the conversion? Well we did just that, and the main body of the code turned out to be quite tidy:

;; n-sat-to-clique: formula -> (listof edge)
;; transform an input to n-sat to an input for clique
;; assume the input expression is in CNF, and that
(define (n-sat-to-clique expr)
(let* ([conjuncts (∧-conjuncts expr)]
[columns (map (λ (x) (∨-disjuncts x)) conjuncts)]
[labeled-columns (label-graph columns 1)]
[possible-edges (all-pairs-in-distinct-lists labeled-columns)])
(list->set (filter no-conflict? possible-edges))))

We have a data structure for a general formula (and provide a function to compute the conjunctive normal form of any expression), and a data structure for a graph (essentially, a list of pairs of labelled vertices), and so the process of checking all possible edges, and filtering out those which have no conflict, clearly takes $O(n^2)$ time.

The rest of the code required to run the function above is available on this blog’s Github page.

Other NP-complete Problems, and P =? NP

In the real world NP-complete problems show up everywhere. Application domains include cryptography, financial portfolio and investment management, scheduling and operation dynamics, packing problems, chem- and bioinformatics, guarding art galleries, circuit design, compiler design, and even modelling social networks.

There are even many questions one can ask about games that turn out to be NP-complete in complexity. For instance, many questions about the classic game of Tetris are NP-complete, along with Minesweeper, FreeCell, Lemmings, Battleship, and Mastermind.

Now the big unsolved question is does P = NP? In other words, can any of these seemingly hard problems be solved quickly? The simple fact is, if the answer is yes to one such problem, then the answer is yes not only to all NP-complete problems, but to all problems in NP (as we saw in our lemma earlier). This is the heart of the million-dollar question that has been crowned the most difficult open problem in computer science to date. Almost everyone agrees that P and NP should not be equal, but nobody can prove it.

Of course, common people love to talk about P and NP because of all of the wild things that would happen if we suddenly discovered that P = NP. All widely used security systems would fail, internet transactions would no longer be safe, efficiency in industry would increase by orders of magnitude, we’d unlock the secrets of the human genome, we’d quickly solve open mathematical problems and find staggeringly ginormicon primes (ginormicon = gigantic + enormous + decepticon), governments will topple, rioting in the streets, the artificial intelligence singularity will occur, etc., etc.

But all of this is just media hype. The likely reality is that some problems are simply too hard to be solved in polynomial time in general, just as there are probably problems which have no polynomial-time verifiers (i.e., problems outside of NP), excluding the trivial problems which are undecidable. In the end, it’s just a matter of time until mathematics sorts everything out, or proves that it’s impossible to do so. Indeed, just two years ago this author remembers waking up to the news that there was a 100 page proof that P is not equal to NP, and moreover that Stephen Cook himself considered it a serious attempt. Unfortunately it turned out to have irreparable flaws, but failure made it no less exciting: this is how great mathematics is made.

On the other hand, there are cranks out there who have, for the last few years, been convinced that they have proved P = NP, but are ignorant of their own flaws and the rest of the world’s criticism. May Gauss have mercy on their mathematical souls.

Until next time!