Occam’s Razor and PAC-learning

So far our discussion of learning theory has been seeing the definition of PAC-learningtinkering with it, and seeing simple examples of learnable concept classes. We’ve said that our real interest is in proving big theorems about what big classes of problems can and can’t be learned. One major tool for doing this with PAC is the concept of VC-dimension, but to set the stage we’re going to prove a simpler theorem that gives a nice picture of PAC-learning when your hypothesis class is small. In short, the theorem we’ll prove says that if you have a finite set of hypotheses to work with, and you can always find a hypothesis that’s consistent with the data you’ve seen, then you can learn efficiently. It’s obvious, but we want to quantify exactly how much data you need to ensure low error. This will also give us some concrete mathematical justification for philosophical claims about simplicity, and the theorems won’t change much when we generalize to VC-dimension in a future post.

The Chernoff bound

One tool we will need in this post, which shows up all across learning theory, is the Chernoff-Hoeffding bound. We covered this famous inequality in detail previously on this blog, but the part of that post we need is the following theorem that says, informally, that if you average a bunch of bounded random variables, then the probability this average random variable deviates from its expectation is exponentially small in the amount of deviation. Here’s the slightly simplified version we’ll use:

Theorem: Let X_1, \dots, X_m be independent random variables whose values are in the range [0,1]. Call \mu_i = \mathbf{E}[X_i], X = \sum_i X_i, and \mu = \mathbf{E}[X] = \sum_i \mu_i. Then for all t > 0,

\displaystyle \Pr(|X-\mu| > t) \leq 2e^{-2t^2 / m}

One nice thing about the Chernoff bound is that it doesn’t matter how the variables are distributed. This is important because in PAC we need guarantees that hold for any distribution generating data. Indeed, in our case the random variables above will be individual examples drawn from the distribution generating the data. We’ll be estimating the probability that our hypothesis has error deviating more than \varepsilon, and we’ll want to bound this by \delta, as in the definition of PAC-learning. Since the amount of deviation (error) and the number of samples (m) both occur in the exponent, the trick is in balancing the two values to get what we want.

Realizability and finite hypothesis classes

Let’s recall the PAC model once more. We have a distribution D generating labeled examples (x, c(x)), where c is an unknown function coming from some concept class C. Our algorithm can draw a polynomial number of these examples, and it must produce a hypothesis h from some hypothesis class H (which may or may not contain c). The guarantee we need is that, for any \delta, \varepsilon > 0, the algorithm produces a hypothesis whose error on D is at most \varepsilon, and this event happens with probability at least 1-\delta. All of these probabilities are taken over the randomness in the algorithm’s choices and the distribution D, and it has to work no matter what the distribution D is.

Let’s introduce some simplifications. First, we’ll assume that the hypothesis and concept classes H and C are finite. Second, we’ll assume that C \subset H, so that you can actually hope to find a hypothesis of zero error. This is called realizability. Later we’ll relax these first two assumptions, but they make the analysis a bit cleaner. Finally, we’ll assume that we have an algorithm which, when given labeled examples, can find in polynomial time a hypothesis h \in H that is consistent with every example.

These assumptions give a trivial learning algorithm: draw a bunch of examples and output any consistent hypothesis. The question is, how many examples do we need to guarantee that the hypothesis we find has the prescribed generalization error? It will certainly grow with 1 / \varepsilon, but we need to ensure it will only grow polynomially fast in this parameter. Indeed, realizability is such a strong assumption that we can prove a polynomial bound using even more basic probability theory than the Chernoff bound.

Theorem: A algorithm that efficiently finds a consistent hypothesis will PAC-learn any finite concept class provided it has at least m samples, where

\displaystyle m \geq \frac{1}{\varepsilon} \left ( \log |H| + \log \left ( \frac{1}{\delta} \right ) \right )

Proof. All we need to do is bound the probability that a bad hypothesis (one with error more than \varepsilon) is consistent with the given data. Now fix D, c, \delta, \varepsilon, and draw m examples and let h be any hypothesis that is consistent with the drawn examples. Suppose that the bad thing happens, that \Pr_D(h(x) \neq c(x)) > \varepsilon.

Because the examples are all drawn independently from D, the chance that all m examples are consistent with h is

\displaystyle (1 - \Pr_{x \sim D}(h(x) \neq c(x)))^m < (1 - \varepsilon)^m

What we’re saying here is, the probability that a specific bad hypothesis is actually consistent with your drawn examples is exponentially small in the error tolerance. So if we apply the union bound, the probability that some hypothesis you could produce is bad is at most (1 - \varepsilon)^m S, where S is the number of hypotheses the algorithm might produce.

A crude upper bound on the number of hypotheses you could produce is just the total number of hypotheses, |H|. Even cruder, let’s use the inequality (1 - x) < e^{-x} to give the bound

\displaystyle (1 - \varepsilon)^m |H| < e^{-\varepsilon m} |H|

Now we want to make sure that this probability, the probability of choosing a high-error (yet consistent) hypothesis, is at most \delta. So we can set the above quantity less than \delta and solve for m:

\displaystyle e^{-\varepsilon m} |H| \leq \delta

Taking logs and solving for m gives the desired bound.

\square

An obvious objection is: what if you aren’t working with a hypothesis class where you can guarantee that you’ll find a consistent hypothesis? Well, in that case we’ll need to inspect the definition of PAC again and reevaluate our measures of error. It turns out we’ll get a similar theorem as above, but with the stipulation that we’re only achieving error within epsilon of the error of the best available hypothesis.

But before we go on, this theorem has some deep philosophical interpretations. In particular, suppose that, before drawing your data, you could choose to work with one of two finite hypothesis classes H_1, H_2, with |H_1| > |H_2|. If you can find a consistent hypothesis no matter which hypothesis class you use, then this theorem says that your generalization guarantees are much stronger if you start with the smaller hypothesis class.

In other words, all else being equal, the smaller set of hypotheses is better. For this reason, the theorem is sometimes called the “Occam’s Razor” theorem. We’ll see a generalization of this theorem in the next section.

Unrealizability and an extra epsilon

Now suppose that $H$ doesn’t contain any hypotheses with error less than \varepsilon. What can we hope to do in this case? One thing is that we can hope to find a hypothesis whose error is within \varepsilon of the minimal error of any hypothesis in H. Moreover, we might not have any consistent hypotheses for some data samples! So rather than require an algorithm to produce an h \in H that is perfectly consistent with the data, we just need it to produce a hypothesis that has minimal empirical error, in the sense that it is as close to consistent as the best hypothesis of h on the data you happened to draw. It seems like such a strategy would find you a hypothesis that’s close to the best one in H, but we need to prove it and determine how many samples we need to draw to succeed.

So let’s make some definitions to codify this. For a given hypothesis, call \textup{err}(h) the true error of h on the distribution D. Our assumption is that there may be no hypotheses in H with \textup{err}(h) = 0. Next we’ll call the empirical error \hat{\textup{err}}(h).

Definition: We say a concept class C is agnostically learnable using the hypothesis class H if for all c \in C and all distributions D (and all \varepsilon, \delta > 0), there is a learning algorithm A which produces a hypothesis h that with probability at least 1 - \delta satisfies

\displaystyle \text{err}(h) \leq \min_{h' \in H} \text{err}(h') + \varepsilon

and everything runs in the same sort of polynomial time as for vanilla PAC-learning. This is called the agnostic setting or the unrealizable setting, in the sense that we may not be able to find a hypothesis with perfect empirical error.

We seek to prove that all concept classes are agnostically learnable with a finite hypothesis class, provided you have an algorithm that can minimize empirical error. But actually we’ll prove something stronger.

Theorem: Let H be a finite hypothesis class and m the number of samples drawn. Then for any \delta > 0, with probability 1-\delta the following holds:

\displaystyle \forall h \in H, \hat{\text{err}}(h) \leq \text{err}(h) + \sqrt{\frac{\log |H| + \log(2 / \delta)}{2m}}

In other words, we can precisely quantify how the empirical error converges to the true error as the number of samples grows. But this holds for all hypotheses in H, so this provides a uniform bound of the difference between true and empirical error for the entire hypothesis class.

Proving this requires the Chernoff bound. Fix a single hypothesis h \in H. If you draw an example x, call Z the random variable which is 1 when h(x) \neq c(x), and 0 otherwise. So if you draw m samples and call the i-th variable Z_i, the empirical error of the hypothesis is \frac{1}{m}\sum_i Z_i. Moreover, the actual error is the expectation of this random variable since \mathbf{E}[1/m \sum_i Z_i] = Z.

So what we’re asking is the probability that the empirical error deviates from the true error by a lot. Let’s call “a lot” some parameter \varepsilon/2 > 0 (the reason for dividing by two will become clear in the corollary to the theorem). Then plugging things into the Chernoff-Hoeffding bound gives a bound on the probability of the “bad event,” that the empirical error deviates too much.

\displaystyle \Pr[|\hat{\text{err}}(h) - \text{err}(h)| > \varepsilon / 2] < 2e^{-\frac{\varepsilon^2m}{2}}

Now to get a bound on the probability that some hypothesis is bad, we apply the union bound and use the fact that |H| is finite to get

\displaystyle \Pr[|\hat{\text{err}}(h) - \text{err}(h)| > \varepsilon / 2] < 2|H|e^{-\frac{\varepsilon^2m}{2}}

Now say we want to bound this probability by \delta. We set 2|H|e^{-\varepsilon^2m/2} \leq \delta, solve for m, and get

\displaystyle m \geq \frac{2}{\varepsilon^2}\left ( \log |H| + \log \frac{2}{\delta} \right )

This gives us a concrete quantification of the tradeoff between m, \varepsilon, \delta, and |H|. Indeed, if we pick m to be this large, then solving for \varepsilon / 2 gives the exact inequality from the theorem.

\square

Now we know that if we pick enough samples (polynomially many in all the parameters), and our algorithm can find a hypothesis h of minimal empirical error, then we get the following corollary:

Corollary: For any \varepsilon, \delta > 0, the algorithm that draws m \geq \frac{2}{\varepsilon^2}(\log |H| + \log(2/ \delta)) examples and finds any hypothesis of minimal empirical error will, with probability at least 1-\delta, produce a hypothesis that is within \varepsilon of the best hypothesis in H.

Proof. By the previous theorem, with the desired probability, for all h \in H we have |\hat{\text{err}}(h) - \text{err}(h)| < \varepsilon/2. Call g = \min_{h' \in H} \text{err}(h'). Then because the empirical error of h is also minimal, we have |\hat{\text{err}}(g) - \text{err}(h)| < \varepsilon / 2. And using the previous theorem again and the triangle inequality, we get |\text{err}(g) - \text{err}(h)| < 2 \varepsilon / 2 = \varepsilon. In words, the true error of the algorithm’s hypothesis is close to the error of the best hypothesis, as desired.

\square

Next time

Both of these theorems tell us something about the generalization guarantees for learning with hypothesis classes of a certain size. But this isn’t exactly the most reasonable measure of the “complexity” of a family of hypotheses. For example, one could have a hypothesis class with a billion intervals on \mathbb{R} (say you’re trying to learn intervals, or thresholds, or something easy), and the guarantees we proved in this post are nowhere near optimal.

So the question is: say you have a potentially infinite class of hypotheses, but the hypotheses are all “simple” in some way. First, what is the right notion of simplicity? And second, how can you get guarantees based on that analogous to these? We’ll discuss this next time when we define the VC-dimension.

Until then!

About these ads

Parameterizing the Vertex Cover Problem

I’m presenting a paper later this week at the Matheamtical Foundations of Computer Science 2014 in Budapest, Hungary. This conference is an interesting mix of logic and algorithms that aims to bring together researchers from these areas to discuss their work. And right away the first session on the first day focused on an area I know is important but have little experience with: fixed parameter complexity. From what I understand it’s not that popular of a topic at major theory conferences in the US (there appears to be only one paper on it at this year’s FOCS conference), but the basic ideas are worth knowing.

The basic idea is pretty simple: some hard computational problems become easier (read, polynomial-time solvable) if you fix some parameters involved to constants. Preferably small constants. For example, finding cliques of size k in a graph is NP-hard if k is a parameter, but if you fix k to a constant then you can check all possible subsets of size k in O(n^k) time. This is kind of a silly example because there are much faster ways to find triangles than checking all O(n^3) subsets of vertices, but part of the point of fixed-parameter complexity is to find the fastest algorithms in these fixed-parameter settings. Since in practice parameters are often small [citation needed], this analysis can provide useful practical algorithmic alternatives to heuristics or approximate solutions.

One important tool in the theory of fixed-parameter tractability is the idea of a kernel. I think it’s an unfortunate term because it’s massively overloaded in mathematics, but the idea is to take a problem instance with the parameter k, and carve out “easy” regions of the instance (often reducing k as you go) until the runtime of the trivial brute force algorithm only depends on k and not on the size of the input. The point is that the solution you get on this “carved out” instance is either the same as the original, or can be extended back to the original with little extra work. There is a more formal definition we’ll state, but there is a canonical example that gives a great illustration.

Consider the vertex cover problem. That is, you give me a graph G = (V,E) and a number k and I have to determine if there is a subset of \leq k vertices of G that touch all of the edges in E. This problem is fixed-parameter tractable because, as with k-clique one can just check all subsets of size k. The kernel approach we’ll show now is much smarter.

What you do is the following. As long as your graph has a vertex of degree > k, you remove it and reduce k by 1. This is because a vertex of degree > k will always be chosen for a vertex cover. If it’s not, then you need to include all of its neighbors to cover its edges, but there are > k neighbors and your vertex cover is constrained by size k. And so you can automatically put this high-degree vertex in your cover, and use induction on the smaller graph.

Once you can’t remove any more vertices there are two cases. In the case that there are more than k^2 edges, you output that there is no vertex cover. Indeed, if you only get k vertices in your cover and you removed all vertices of degree > k, then each can cover at most k edges, giving a total of at most k^2. Otherwise, if there are at most k^2 edges, then you can remove all the isolated vertices and show that there are only \leq 2k^2 vertices left. This is because each edge touches only two vertices, so in the worst case they’re all distinct. This smaller subgraph is called a kernel of the vertex cover, and the fact that its size depends only on k is the key. So you can look at all 2^{2k^2} = O(1) subsets to determine if there’s a cover of the size you want. If you find a cover of the kernel, you add back in all the large-degree vertices you deleted and you’re done.

Now, even for small k this is a pretty bad algorithm (k=5 gives 2^{50} subsets to inspect), but with more detailed analysis you can do significantly better. In particular, the best known bound reduces vertex cover to a kernel of size 2k - c \log(k) vertices for any constant c you specify. Getting \log(k) vertices is known to imply P = NP, and with more detailed complexity assumptions it’s even hard to get a graph with fewer than O(k^{2-\varepsilon}) edges for any \varepsilon > 0. These are all relatively recent results whose associated papers I have not read.

Even with these hardness results, there are two reasons why this kind of analysis is useful. The first is that it gives us a clearer picture of the complexity of these problems. In particular, the reduction we showed for vertex cover gives a time O(2^{2k^2} + n + m)-time algorithm, which you can then compare directly to the trivial O(n^k) time brute force algorithm and measure the difference. Indeed, if k = o(\sqrt{(k/2) log(n)}) then the kernelized approach is faster.

The second reason is that the kernel approach usually results in simple and quick checks for negative answers to a problem. In particular, if you want to check for k-sized set covers in a graph in the real world, this analysis shows that the first thing you should do is check if the kernel has size > k^2. If so, you can immediately give a “no” answer. So useful kernels can provide insight into the structure of a problem that can be turned into heuristic tools even when it doesn’t help you solve the problem exactly.

So now let’s just see the prevailing definition of a “kernelization” of a problem. This comes from the text of Downey and Fellows.

Definition: kernelization of a parameterized problem L (formally, a language where each string x is paired with a positive integer k) is a \textup{poly}(|x|, k)-time algorithm that converts instances (x,k) into instances (x', k') with the following three properties.

  • (x,k) is a yes instance of L if and only if (x', k') is.
  • |x'| \leq f(k) for some computable function f: \mathbb{N} \to \mathbb{N}.
  • k' \leq g(k) for some computable function g: \mathbb{N} \to \mathbb{N}.

The output (x', k') is called a kernel, and the problem is said to admit a polynomial kernel if f(k) = O(k^c) for some constant c.

So we showed that vertex cover admits a polynomial kernel (in fact, a quadratic one).

Now the nice theorem is that a problem is fixed-parameter tractable if and only if it admits a polynomial kernel. Finding a kernel is conceptually easier because, like in vertex cover, it allows you to introduce additional assumptions on the structure of the instances you’re working with. But more importantly from a theoretical standpoint, measuring the size and complexity of kernels for NP-hard problems gives us a way to discriminate among problems within NP. That and the chance to get some more practical tools for NP-hard problems makes parameterized complexity more interesting than it sounds at first.

Until next time!

An Update on “Coloring Resilient Graphs”

A while back I announced a preprint of a paper on coloring graphs with certain resilience properties. I’m pleased to announce that it’s been accepted to the Mathematical Foundations of Computer Science 2014, which is being held in Budapest this year. Since we first published the preprint we’ve actually proved some additional results about resilience, and so I’ll expand some of the details here. I think it makes for a nicer overall picture, and in my opinion it gives a little more justification that resilient coloring is interesting, at least in contrast to other resilience problems.

Resilient SAT

Recall that a “resilient” yes-instance of a combinatorial problem is one which remains a yes-instance when you add or remove some constraints. The way we formalized this for SAT was by fixing variables to arbitrary values. Then the question is how resilient does an instance need to be in order to actually find a certificate for it? In more detail,

Definition: r-resilient k-SAT formulas are satisfiable formulas in k-CNF form (conjunctions of clauses, where each clause is a disjunction of three literals) such that for all choices of r variables, every way to fix those variables yields a satisfiable formula.

For example, the following 3-CNF formula is 1-resilient:

\displaystyle (a \vee b \vee c) \wedge (a \vee \overline{b} \vee \overline{c}) \wedge (\overline{a} \vee \overline{b} \vee c)

The idea is that resilience may impose enough structure on a SAT formula that it becomes easy to tell if it’s satisfiable at all. Unfortunately for SAT (though this is definitely not the case for coloring), there are only two possibilities. Either the instances are so resilient that they never existed in the first place (they’re vacuously trivial), or the instances are NP-hard. The first case is easy: there are no k-resilient k-SAT formulas. Indeed, if you’re allowed to fix k variables to arbitrary values, then you can just pick a clause and set all its variables to false. So no formula can ever remain satisfiable under that condition.

The second case is when the resilience is strictly less than the clause size, i.e. r-resilient k-SAT for 0 \leq r < k. In this case the problem of finding a satisfying assignment is NP-hard. We’ll show this via a sequence of reductions which start at 3-SAT, and they’ll involve two steps: increasing the clause size and resilience, and decreasing the clause size and resilience. The trick is in balancing which parts are increased and decreased. I call the first step the “blowing up” lemma, and the second part the “shrinking down” lemma.

Blowing Up and Shrinking Down

Here’s the intuition behind the blowing up lemma. If you give me a regular (unresilient) 3-SAT formula \varphi, what I can do is make a copy of \varphi with a new set of variables and OR the two things together. Call this \varphi^1 \vee \varphi^2. This is clearly logically equivalent to the original formula; if you give me a satisfying assignment for the ORed thing, I can just see which of the two clauses are satisfied and use that sub-assignment for \varphi, and conversely if you can satisfy \varphi it doesn’t matter what truth values you choose for the new set of variables. And further you can transform the ORed formula into a 6-SAT formula in polynomial time. Just apply deMorgan’s rules for distributing OR across AND.

Now the choice of a new set of variables allows us to give some resilient. If you fix one variable to the value of your choice, I can always just work with the other set of variables. Your manipulation doesn’t change the satisfiability of the ORed formula, because I’ve added all of this redundancy. So we took a 3-SAT formula and turned it into a 1-resilient 6-SAT formula.

The idea generalizes to the blowing up lemma, which says that you can measure the effects of a blowup no matter what you start with. More formally, if s is the number of copies of variables you make, k is the clause size of the starting formula \varphi, and r is the resilience of \varphi, then blowing up gives you an [(r+1)s - 1]-resilient (sk)-SAT formula. The argument is almost identical to the example above the resilience is more general. Specifically, if you fix fewer than (r+1)s variables, then the pigeonhole principle guarantees that one of the s copies of variables has at most r fixed values, and we can just work with that set of variables (i.e., this small part of the big ORed formula is satisfiable if \varphi was r-resilient).

The shrinking down lemma is another trick that is similar to the reduction from k-SAT to 3-SAT. There you take a clause like v \vee w \vee x \vee y \vee z and add new variables z_i to break up the clause in to clauses of size 3 as follows:

\displaystyle (v \vee w \vee z_1) \wedge (\neg z_1 \vee x \vee z_2) \wedge (\neg z_2 \vee y \vee z)

These are equivalent because your choice of truth values for the z_i tell me which of these sub-clauses to look for a true literal of the old variables. I.e. if you choose z_1 = T, z_2 = F then you have to pick either y or z to be true. And it’s clear that if you’re willing to double the number of variables (a linear blowup) you can always get a k-clause down to an AND of 3-clauses.

So the shrinking down reduction does the same thing, except we only split clauses in half. For a clause C, call C[:k/2] the first half of a clause and C[k/2:] the second half (you can see how my Python training corrupts my notation preference). Then to shrink a clause C_i down from size k to size \lceil k/2 \rceil + 1 (1 for the new variable), add a variable z_i and break C_i into

\displaystyle (C_i[:k/2] \vee z_i) \wedge (\neg z_i \vee C[k/2:])

and just AND these together for all clauses. Call the original formula \varphi and the transformed one \psi. The formulas are logically equivalent for the same reason that the k-to-3-SAT reduction works, and it’s already in the right CNF form. So resilience is all we have to measure. The claim is that the resilience is q = \min(r, \lfloor k/2 \rfloor), where r is the resilience of \varphi.

The reason for this is that if all the fixed variables are old variables (not z_i), then nothing changes and the resilience of the original \phi keeps us safe. And each z_i we fix has no effect except to force us to satisfy a variable in one of the two halves. So there is this implication that if you fix a z_i you have to also fix a regular variable. Because we can’t guarantee anything if we fix more than r regular variables, we’d have to stop before fixing r of the z_i. And because these new clauses have size k/2 + 1, we can’t do this more than k/2 times or else we risk ruining an entire clause. So this give the definition of q. So this proves the shrinking down lemma.

Resilient SAT is always hard

The blowing up and shrinking down lemmas can be used to show that r-resilient k-SAT is NP-hard for all r < k. What we do is reduce from 3-SAT to an r-resilient k-SAT instance in such a way that the 3-SAT formula is satisfiable if and only if the transformed formula is resiliently satisfiable.

What makes these two lemmas work together is that shrinking down shrinks the clause size just barely less than the resilience, and blowing up increases resilience just barely more than it increases clause size. So we can combine these together to climb from 3-SAT up to some high resilience and satisfiability, and then iteratively shrink down until we hit our target.

One might worry that it will take an exponential number of reductions (or a few reductions of exponential size) to get from 3-SAT to the (r,k) of our choice, but we have a construction that does it in at most four steps, with only a linear initial blowup from 3-SAT to r-resilient 3(r+1)-SAT. Then, to deal with the odd ceilings and floors in the shrinking down lemma, you have to find a suitable larger k to reduce to (by padding with useless variables, which cannot make the problem easier). And you choose this k so that you only need at most two applications of shrinking down to get to (k-1)-resilient k-SAT. Our preprint has the gory details (which has an inelegant part that is not worth writing here), but in the end you show that (k-1)-resilient k-SAT is hard, and since that’s the maximal amount of resilience before the problem becomes vacuously trivial, all smaller resilience values are also hard.

So how does this relate to coloring?

I’m happy about this result not just because it answers an open question I’m honestly curious about, but also because it shows that resilient coloring is more interesting. Basically this proves that satisfiability is so hard that no amount of resilience can make it easier in the worst case. But coloring has a gradient of difficulty. Once you get to order k^2 resilience for k-colorable graphs, the coloring problem can be solved efficiently by a greedy algorithm (and it’s not a vacuously empty class of graphs). Another thing on the side is that we use the hardness of resilient SAT to get the hardness results we have for coloring.

If you really want to stretch the implications, you might argue that this says something like “coloring is somewhat easier than SAT,” because we found a quantifiable axis along which SAT remains difficult while coloring crumbles. The caveat is that fixing colors of vertices is not exactly comparable to fixing values of truth assignments (since we are fixing lots of instances by fixing a variable), but at least it’s something concrete.

Coloring is still mostly open, and recently I’ve been going to talks where people are discussing startlingly similar ideas for things like Hamiltonian cycles. So that makes me happy.

Until next time!

A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

\displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)

The wedge \wedge denotes a logical AND and the vee \vee denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an x_i or the negation \overline{x_i}, and connectives \wedge and \vee, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form C_1 \vee C_2 \vee C_3 where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the \vee‘s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph G, we’ll construct a set S_G of labeled truth assignments, and we’ll define the distribution D to be the uniform distribution over those truth assignments used in S_G. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in S_G exactly how we labeled them, and we set the allowed error \varepsilon to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of G). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of G).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph G with n nodes v_1, \dots, v_n and a set of m undirected edges E, we construct a set of examples with positive labels S^+ and one with negative examples S^-. The examples are truth assignments to n variables, which we label x_1, \dots, x_n, and we identify a truth assignment to the \left \{ 0,1 \right \}-valued vector (x_1, x_2, \dots, x_n) in the usual way (true is 1, false is 0).

The positive examples S^+ are simple: for each v_i add a truth assignment x_i = T, x_j = F for j \neq i. I.e., the binary vector is (1, \dots, 1,0,1, \dots, 1), and the zero is in the i-th position.

The negative examples S^- come from the edges. For each edge (v_i, v_j) \in E, we add the example with a zero in the i-th and j-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

PAC-reduction

Claim: G is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula \varphi.

Again, consistent just means that \varphi is satisfied by every truth assignment in S^+ and unsatisfied by every example in S^-. Since we chose our distribution to be uniform over S^+ \cup S^-, we don’t care what \varphi does elsewhere.

Indeed, if G is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let T_R be the AND of all the literals x_i for which vertex v_i is not red. For each such i, the corresponding example in S^+ will satisfy T_R, because we put a zero in the i-th position and ones everywhere else. Similarly, no example in S^- will make T_R true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is (v_1,v_2). Then the corresponding negative example is (0,0,1). Unless both v_1 and v_2 are colored red, one of x_1, x_2 will have to be ANDed as part of T_R. But the example has a zero for both x_1 and x_2, so T_R would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get T_R \vee T_B \vee T_Y. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF \varphi. We need to construct a three coloring of G. It goes in largely the same way: label the clauses \varphi = T_R \vee T_B \vee T_Y for Red, Blue, and Yellow, and then color a vertex v_i the color of the clause that is satisfied by the corresponding example in S^+. There must be some clause that does this because \varphi is consistent with S^+, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge (v_i, v_j), and both v_i and v_j are colored, say, blue. Look at the clause T_B: since v_i and v_j are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make T_B true. Since those two positive examples differ in both their i-th and j-th positions, T_B can’t have any of the literals x_i, \overline{x_i}, x_j, \overline{x_j}. But then the negative example for the edge would satisfy T_B because it has 1’s everywhere except i,j! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph G; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick \delta < 1/2 and \varepsilon \leq 1/(2|S^+ \cup S^-|) in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least 1 / |S^+ \cup S^-|, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class C (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within C. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning C suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class \mathsf{C} over a set X is efficiently PAC-learnable using the hypothesis class \mathsf{H} if there exists an algorithm A(\varepsilon, \delta) with access to a query function for \mathsf{C} and runtime O(\text{poly}(1/\varepsilon, 1/\delta)), such that for all c \in \mathsf{C}, all distributions D over X, and all 0 < \delta , \varepsilon < 1/2, the probability that A produces a hypothesis h \in \mathsf{H} with error at most \varepsilon is at least 1-\delta.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!