A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$ \displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $ \wedge$ denotes a logical AND and the vee $ \vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $ x_i$ or the negation $ \overline{x_i}$, and connectives $ \wedge$ and $ \vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $ C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $ \vee$’s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $ G$, we’ll construct a set $ S_G$ of labeled truth assignments, and we’ll define the distribution $ D$ to be the uniform distribution over those truth assignments used in $ S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $ S_G$ exactly how we labeled them, and we set the allowed error $ \varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $ G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $ G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $ G$ with $ n$ nodes $ v_1, \dots, v_n$ and a set of $ m$ undirected edges $ E$, we construct a set of examples with positive labels $ S^+$ and one with negative examples $ S^-$. The examples are truth assignments to $ n$ variables, which we label $ x_1, \dots, x_n$, and we identify a truth assignment to the $ \left \{ 0,1 \right \}$-valued vector $ (x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $ S^+$ are simple: for each $ v_i$ add a truth assignment $ x_i = T, x_j = F$ for $ j \neq i$. I.e., the binary vector is $ (1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $ i$-th position.

The negative examples $ S^-$ come from the edges. For each edge $ (v_i, v_j) \in E$, we add the example with a zero in the $ i$-th and $ j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

PAC-reduction

Claim: $ G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $ \varphi$.

Again, consistent just means that $ \varphi$ is satisfied by every truth assignment in $ S^+$ and unsatisfied by every example in $ S^-$. Since we chose our distribution to be uniform over $ S^+ \cup S^-$, we don’t care what $ \varphi$ does elsewhere.

Indeed, if $ G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $ T_R$ be the AND of all the literals $ x_i$ for which vertex $ v_i$ is not red. For each such $ i$, the corresponding example in $ S^+$ will satisfy $ T_R$, because we put a zero in the $ i$-th position and ones everywhere else. Similarly, no example in $ S^-$ will make $ T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $ (v_1,v_2)$. Then the corresponding negative example is $ (0,0,1)$. Unless both $ v_1$ and $ v_2$ are colored red, one of $ x_1, x_2$ will have to be ANDed as part of $ T_R$. But the example has a zero for both $ x_1$ and $ x_2$, so $ T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $ T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $ \varphi$. We need to construct a three coloring of $ G$. It goes in largely the same way: label the clauses $ \varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $ v_i$ the color of the clause that is satisfied by the corresponding example in $ S^+$. There must be some clause that does this because $ \varphi$ is consistent with $ S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $ (v_i, v_j)$, and both $ v_i$ and $ v_j$ are colored, say, blue. Look at the clause $ T_B$: since $ v_i$ and $ v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make $ T_B$ true. Since those two positive examples differ in both their $ i$-th and $ j$-th positions, $ T_B$ can’t have any of the literals $ x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $ T_B$ because it has 1’s everywhere except $ i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $ G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $ \delta < 1/2$ and $ \varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $ 1 / |S^+ \cup S^-|$, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $ C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $ C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $ C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $ \mathsf{C}$ over a set $ X$ is efficiently PAC-learnable using the hypothesis class $ \mathsf{H}$ if there exists an algorithm $ A(\varepsilon, \delta)$ with access to a query function for $ \mathsf{C}$ and runtime $ O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $ c \in \mathsf{C}$, all distributions $ D$ over $ X$, and all $ 0 < \delta , \varepsilon < 1/2$, the probability that $ A$ produces a hypothesis $ h \in \mathsf{H}$ with error at most $ \varepsilon$ is at least $ 1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!

On Coloring Resilient Graphs

I’m pleased to announce that another paper of mine is finished. This one just got accepted to MFCS 2014, which is being held in Budapest this year (this whole research thing is exciting!). This is joint work with my advisor, Lev Reyzin. As with my first paper, I’d like to explain things here on my blog a bit more informally than a scholarly article allows.

A Recent History of Graph Coloring

One of the first important things you learn when you study graphs is that coloring graphs is hard. Remember that coloring a graph with $ k$ colors means that you assign each vertex a color (a number in $ \left \{ 1, 2, \dots, k \right \}$) so that no vertex is adjacent to a vertex of the same color (no edge is monochromatic). In fact, even deciding whether a graph can be colored with just $ 3$ colors (not to mention finding such a coloring) has no known polynomial time algorithm. It’s what’s called NP-hard, which means that almost everyone believes it’s hopeless to solve efficiently in the worst case.

One might think that there’s some sort of gradient to this problem, that as the graphs get more “complicated” it becomes algorithmically harder to figure out how colorable they are. There are some notions of “simplicity” and “complexity” for graphs, but they hardly fall on a gradient. Just to give the reader an idea, here are some ways to make graph coloring easy:

  • Make sure your graph is planar. Then deciding 4-colorability is easy because the answer is always yes.
  • Make sure your graph is triangle-free and planar. Then finding a 3-coloring is easy.
  • Make sure your graph is perfect (which again requires knowledge about how colorable it is).
  • Make sure your graph has tree-width or clique-width bounded by a constant.
  • Make sure your graph doesn’t have a certain kind of induced subgraph (such as having no induced paths of length 4 or 5).

Let me emphasize that these results are very difficult and tricky to compare. The properties are inherently discrete (either perfect or imperfect, planar or not planar). The fact that the world has not yet agreed upon a universal measure of complexity for graphs (or at least one that makes graph coloring easy to understand) is not a criticism of the chef but a testament to the challenge and intrigue of the dish.

Coloring general graphs is much bleaker, where the focus has turned to approximations. You can’t “approximate” the answer to whether a graph is colorable, so now the key here is that we are actually trying to find an approximate coloring. In particular, if you’re given some graph $ G$ and you don’t know the minimum number of colors needed to color it (say it’s $ \chi(G)$, this is called the chromatic number), can you easily color it with what turns out to be, say, $ 2 \chi(G)$ colors?

Garey and Johnson (the gods of NP-hardness) proved this problem is again hard. In fact, they proved that you can’t do better than twice the number of colors. This might not seem so bad in practice, but the story gets worse. This lower bound was improved by Zuckerman, building on the work of Håstad, to depend on the size of the graph! That is, unless $ P=NP$, all efficient algorithms will use asymptotically more than $ \chi(G) n^{1 – \varepsilon}$ colors for any $ \varepsilon > 0$ in the worst case, where $ n$ is the number of vertices of $ G$. So the best you can hope for is being off by something like a multiplicative factor of $ n / \log n$. You can actually achieve this (it’s nontrivial and takes a lot of work), but it carries that aura of pity for the hopeful graph colorer.

The next avenue is to assume you know the chromatic number of your graph, and see how well you can do then. For example: if you are given the promise that a graph $ G$ is 3-colorable, can you efficiently find a coloring with 8 colors? The best would be if you could find a coloring with 4 colors, but this is already known to be NP-hard.

The best upper bounds, algorithms to find approximate colorings of 3-colorable graphs, also pitifully depend on the size of the graph. Remember I say pitiful not to insult the researchers! This decades-long line of work was extremely difficult and deserves the highest praise. It’s just frustrating that the best known algorithm to color a 3-colorable graph requires as many as $ n^{0.2}$ colors. At least it bypasses the barrier of $ n^{1 – \varepsilon}$ mentioned above, so we know that knowing the chromatic number actually does help.

The lower bounds are a bit more hopeful; it’s known to be NP-hard to color a $ k$-colorable graph using $ 2^{\sqrt[3]{k}}$ colors if $ k$ is sufficiently large. There are a handful of other linear lower bounds that work for all $ k \geq 3$, but to my knowledge this is the best asymptotic result. The big open problem (which I doubt many people have their eye on considering how hard it seems) is to find an upper bound depending only on $ k$. I wonder offhand whether a ridiculous bound like $ k^{k^k}$ colors would be considered progress, and I bet it would.

Our Idea: Resilience

So without big breakthroughs on the front of approximate graph coloring, we propose a new front for investigation. The idea is that we consider graphs which are not only colorable, but remain colorable under the adversarial operation of adding a few new edges. More formally,

Definition: A graph $ G = (V,E)$ is called $ r$-resiliently $ k$-colorable if two properties hold

  1. $ G$ is $ k$-colorable.
  2. For any set $ E’$ of $ r$ edges disjoint from $ E$, the graph $ G’ = (V, E \cup E’)$ is $ k$-colorable.

The simplest nontrivial example of this is 1-resiliently 3-colorable graphs. That is a graph that is 3-colorable and remains 3-colorable no matter which new edge you add. And the question we ask of this example: is there a polynomial time algorithm to 3-color a 1-resiliently 3-colorable graph? We prove in our paper that this is actually NP-hard, but it’s not a trivial thing to see.

The chief benefit of thinking about resiliently colorable graphs is that it provides a clear gradient of complexity from general graphs (zero-resilient) to the empty graph (which is $ (\binom{k+1}{2} – 1)$-resiliently $ k$-colorable). We know that the most complex case is NP-hard, and maximally resilient graphs are trivially colorable. So finding the boundary where resilience makes things easy can shed new light on graph coloring.

Indeed, we argue in the paper that lots of important graphs have stronger resilience properties than one might expect. For example, here are the resilience properties of some famous graphs.

From left to right: the Petersen graph, 2-resiliently 3-colorable; the Dürer graph, 4-resiliently 4-colorable; the Grötzsch graph, 4-resiliently 4-colorable; and the Chvátal graph, 3-resiliently 4-colorable. These are all maximally resilient (no graph is more resilient than stated) and chromatic (no graph is colorable with fewer colors)

From left to right: the Petersen graph, 2-resiliently 3-colorable; the Dürer graph, 4-resiliently 4-colorable; the Grötzsch graph, 4-resiliently 4-colorable; and the Chvátal graph, 3-resiliently 4-colorable. These are all maximally resilient (no graph is more resilient than stated) and chromatic (no graph is colorable with fewer colors)

If I were of a mind to do applied graph theory, I would love to know about the resilience properties of graphs that occur in the wild. For example, the reader probably knows the problem of register allocation is a natural graph coloring problem. I would love to know the resilience properties of such graphs, with the dream that they might be resilient enough on average to admit efficient coloring algorithms.

Unfortunately the only way that I know how to compute resilience properties is via brute-force search, and of course this only works for small graphs and small $ k$. If readers are interested I could post such a program (I wrote it in vanilla python), but for now I’ll just post a table I computed on the proportion of small graphs that have various levels of resilience (note this includes graphs that vacuously satisfy the definition).

Percentage of k-colorable graphs on 6 vertices which are n-resilient
k\n       1       2       3       4
  ----------------------------------------
3       58.0    22.7     5.9     1.7
4       93.3    79.3    58.0    35.3
5       99.4    98.1    94.8    89.0
6      100.0   100.0   100.0   100.0

Percentage of k-colorable graphs on 7 vertices which are n-resilient
k\n       1       2       3       4
  ----------------------------------------
3       38.1     8.2     1.2     0.3
4       86.7    62.6    35.0    14.9
5       98.7    95.6    88.5    76.2
6       99.9    99.7    99.2    98.3

Percentage of k-colorable graphs on 8 vertices which are n-resilient
k\n       1       2       3       4
  ----------------------------------------
3       21.3     2.1     0.2     0.0
4       77.6    44.2    17.0     4.5

The idea is this: if this trend continues, that only some small fraction of all 3-colorable graphs are, say, 2-resiliently 3-colorable graphs, then it should be easy to color them. Why? Because resilience imposes structure on the graphs, and that structure can hopefully be realized in a way that allows us to color easily. We don’t know how to characterize that structure yet, but we can give some structural implications for sufficiently resilient graphs.

For example, a 7-resiliently 5-colorable graph can’t have any subgraphs on 6 vertices with $ \binom{6}{2} – 7$ edges, or else we can add enough edges to get a 6-clique which isn’t 5-colorable. This gives an obvious general property about the sizes of subgraphs in resilient graphs, but as a more concrete instance let’s think about 2-resilient 3-colorable graphs $ G$. This property says that no set of 4 vertices may have more than $ 4 = \binom{4}{2} – 2$ edges in $ G$. This rules out 4-cycles and non-isolated triangles, but is it enough to make 3-coloring easy? We can say that $ G$ is a triangle-free graph and a bunch of disjoint triangles, but it’s known 3-colorable non-planar triangle-free graphs can have arbitrarily large chromatic number, and so the coloring problem is hard. Moreover, 2-resilience isn’t enough to make $ G$ planar. It’s not hard to construct a non-planar counterexample, but proving it’s 2-resilient is a tedious task I relegated to my computer.

Speaking of which, the problem of how to determine whether a $ k$-colorable graph is $ r$-resiliently $ k$-colorable is open. Is this problem even in NP? It certainly seems not to be, but if it had a nice characterization or even stronger necessary conditions than above, we might be able to use them to find efficient coloring algorithms.

In our paper we begin to fill in a table whose completion would characterize the NP-hardness of coloring resilient graphs

table

The known complexity of k-coloring r-resiliently k-colorable graphs

Ignoring the technical notion of 2-to-1 hardness, the paper accomplishes this as follows. First, we prove some relationships between cells. In particular, if a cell is NP-hard then so are all the cells to the left and below it. So our Theorem 1, that 3-coloring 1-resiliently 3-colorable graphs is NP-hard, gives us the entire black region, though more trivial arguments give all except the (3,1) cell. Also, if a cell is in P (it’s easy to $ k$-color graphs with that resilience), then so are all cells above and to its right. We prove that $ k$-coloring $ \binom{k}{2}$-resiliently $ k$-colorable graphs is easy. This is trivial: no vertex may have degree greater than $ k-1$, and the greedy algorithm can color such graphs with $ k$ colors. So that gives us the entire light gray region.

There is one additional lower bound comes from the fact that it’s NP-hard to $ 2^{\sqrt[3]{k}}$-color a $ k$-colorable graph. In particular, we prove that if you have any function $ f(k)$ that makes it NP-hard to $ f(k)$-color a $ k$-colorable graph, then it is NP-hard to $ f(k)$-color an $ (f(k) – k)$-resiliently $ f(k)$-colorable graph. The exponential lower bound hence gives us a nice linear lower bound, and so we have the following “sufficiently zoomed out” picture of the table

zoomed-out

The zoomed out version of the classification table above.

The paper contains the details of how these observations are proved, in addition to the NP-hardness proof for 1-resiliently 3-colorable graphs. This leaves the following open problems:

  • Get an unconditional, concrete linear resilience lower bound for hardness.
  • Find an algorithm that colors graphs that are less resilient than $ O(k^2)$. Even determining specific cells like (4,5) or (5,9) would likely give enough insight for this.
  • Classify the tantalizing (3,2) cell (determine if it’s hard or easy to 3-color a 2-resiliently 3-colorable graph) or even better the (4,2) cell.
  • Find a way to relate resilient coloring back to general coloring. For example, if such and such cell is hard, then you can’t approximate k-coloring to within so many colors.

But Wait, There’s More!

Though this paper focuses on graph coloring, our idea of resilience doesn’t stop there (and this is one reason I like it so much!). One can imagine a notion of resilience for almost any combinatorial problem. If you’re trying to satisfy boolean formulas, you can define resilience to mean that you fix the truth value of some variable (we do this in the paper to build up to our main NP-hardness result of 3-coloring 1-resiliently 3-colorable graphs). You can define resilient set cover to allow the removal of some sets. And any other sort of graph-based problem (Traveling salesman, max cut, etc) can be resiliencified by adding or removing edges, whichever makes the problem more constrained.

So this resilience notion is quite general, though it’s hard to define precisely in a general fashion. There is a general framework called Constraint Satisfaction Problems (CSPs), but resilience here seem too general. A CSP is literally just a bunch of objects which can be assigned some set of values, and a set of constraints (k-ary 0-1-valued functions) that need to all be true for the problem to succeed. If we were to define resilience by “adding any constraint” to a given CSP, then there’s nothing to stop us from adding the negation of an existing constraint (or even the tautologically unsatisfiable constraint!). This kind of resilience would be a vacuous definition, and even if we try to rule out these edge cases, I can imagine plenty of weird things that might happen in their stead. That doesn’t mean there isn’t a nice way to generalize resilience to CSPs, but it would probably involve some sort of “constraint class” of acceptable constraints, and I don’t know a reasonable property to impose on the constraint class to make things work.

So there’s lots of room for future work here. It’s exciting to think where it will take me.

Until then!

Miller-Rabin Primality Test

Problem: Determine if a number is prime, with an acceptably small error rate.

Solution: (in Python)

import random

def decompose(n):
   exponentOfTwo = 0

   while n % 2 == 0:
      n = n/2
      exponentOfTwo += 1

   return exponentOfTwo, n

def isWitness(possibleWitness, p, exponent, remainder):
   possibleWitness = pow(possibleWitness, remainder, p)

   if possibleWitness == 1 or possibleWitness == p - 1:
      return False

   for _ in range(exponent):
      possibleWitness = pow(possibleWitness, 2, p)

      if possibleWitness == p - 1:
         return False

   return True

def probablyPrime(p, accuracy=100):
   if p == 2 or p == 3: return True
   if p &lt; 2: return False

   exponent, remainder = decompose(p - 1)

   for _ in range(accuracy):
      possibleWitness = random.randint(2, p - 2)
      if isWitness(possibleWitness, p, exponent, remainder):
         return False

   return True

Discussion: This algorithm is known as the Miller-Rabin primality test, and it was a very important breakthrough in the study of probabilistic algorithms.

Efficiently testing whether a number is prime is a crucial problem in cryptography, because the security of many cryptosystems depends on the use of large randomly chosen primes. Indeed, we’ve seen one on this blog already which is in widespread use: RSA. Randomized algorithms also have quite useful applications in general, because it’s often that a solution which is correct with probability, say, $ 2^{-100}$ is good enough for practice.

But from a theoretical and historical perspective, primality testing lied at the center of a huge problem in complexity theory. In particular, it is unknown whether algorithms which have access to randomness and can output probably correct answers are more powerful than those that don’t. The use of randomness in algorithms comes in a number of formalizations, the most prominent of which is called BPP (Bounded-error Probabilistic Polynomial time). The Miller-Rabin algorithm shows that primality testing is in BPP. On the other hand, algorithms solvable in polynomial time without randomness are in a class called P.

For a long time (from 1975 to 2002), it was unknown whether primality testing was in P or not. There are very few remaining important problems that have BPP algorithms but are not known to be in P. Polynomial identity testing is the main example, and until 2002 primality testing shared its title. Now primality has a known polynomial-time algorithm. One might argue that (in theory) the Miller-Rabin test is now useless, but it’s still a nice example of a nontrivial BPP algorithm.

The algorithm relies on the following theorem:

Theorem: if $ p$ is a prime, let $ s$ be the maximal power of 2 dividing $ p-1$, so that $ p-1 = 2^{s}d$ and $ d$ is odd. Then for any $ 1 \leq n \leq p-1$, one of two things happens:

  • $ n^d = 1 \mod p$ or
  • $ n^{2^j d} = -1 \mod p$ for some $ 0 \leq j < s$.

The algorithm then simply operates as follows: pick nonzero $ n$ at random until both of the above conditions fail. Such an $ n$ is called a witness for the fact that $ p$ is a composite. If $ p$ is not a prime, then there is at least a 3/4 chance that a randomly chosen $ n$ will be a witness.

We leave the proof of the theorem as an exercise. Start with the fact that $ a^{p-1} = 1 \mod p$ (this is Fermat’s Little Theorem). Then use induction to take square roots (the result has to be +/-1 mod p), and continue until you get to $ a^{d}=1 \mod p$.

The Python code above uses Python’s built in modular exponentiation function pow to do fast modular exponents. The isWitness function first checks $ n^d = 1 \mod p$ and then all powers $ n^{2^j d} = -1 \mod p$. The probablyPrime function then simply generates random potential witnesses and checks them via the previous function. The output of the function is True if and only if all of the needed modular equivalences hold for all witnesses inspected. The choice of endpoints being 2 and $ p-2$ are because 1 and $ p-1$ will always have exponents 1 mod $ p$.

Ramsey Number Lower Bound

Define the Ramsey number $ R(k,m)$ to be the minimum number $ n$ of vertices required of the complete graph $ K_n$ so that for any two-coloring (red, blue) of the edges of $ K_n$ one of two things will happen:

  • There is a red $ k$-clique; that is, a complete subgraph of $ k$ vertices for which all edges are red.
  • There is a blue $ m$-clique.

It is known that these numbers are always finite, but it is very difficult to compute them exactly.

Problem: Prove that the Ramsey number $ R(m,m) > n$ whenever $ n,m$ satisfy

$ \displaystyle \binom{n}{m}2^{1-\binom{m}{2}} < 1$

Solution: Color the edges of $ K_n$ uniformly at random (that is, each edge has probability 1/2 of being colored red). For any complete subgraph $ G = K_m$, define by event $ A_G$ the event that $ G$ is monochromatic (its edges are either all red or all blue).

Now the probability that $ A_G$ occurs (where $ G$ is fixed ahead of time) is easy to compute:

$ \displaystyle \textup{Pr}(A_G) = \left (\frac{1}{2} \right)^{\binom{m}{2} – 1} = 2^{1-\binom{m}{2}}$

Since there are $ \binom{n}{m}$ possible subgraphs with $ m$ vertices, The probability that for some $ G$ the event $ A_G$ occurs is at most

$ \displaystyle \binom{n}{m}2^{1-\binom{m}{2}}$

Whenever this quantity is strictly less than 1 (by assumption) then there is a positive probability that no event $ A_G$ will occur. That is, there is a positive probability that a random coloring will have no monochromatic subgraph $ K_m$. So there must exist such a coloring, and the Ramsey number $ R(m,m)$ must be larger than $ n$. $ \square$

Discussion: This proof (originally due to Erdős) is a classic example of the so-called probabilistic method. In particular, we create a probability space from the object we wish to study, and then we make claims about the probability of joint events.

While it seems quite simple in nature, the probabilistic method has been successfully applied to a wide variety of problems in mathematics. For instance, there is an elegant proof in complexity theory that $ \textup{BPP} \subset \textup{P/poly}$ which uses this same method. The probabilistic method has been applied to loads of problems in combinatorics, number theory, and graph theory, and it forms the foundation of the area of random graph theory (which is the setting in which one studies social networks). Perhaps unsurprisingly, there is also a proof of the fundamental theorem of algebra that uses the probabilistic method.