Occam’s Razor and PAC-learning

So far our discussion of learning theory has been seeing the definition of PAC-learningtinkering with it, and seeing simple examples of learnable concept classes. We’ve said that our real interest is in proving big theorems about what big classes of problems can and can’t be learned. One major tool for doing this with PAC is the concept of VC-dimension, but to set the stage we’re going to prove a simpler theorem that gives a nice picture of PAC-learning when your hypothesis class is small. In short, the theorem we’ll prove says that if you have a finite set of hypotheses to work with, and you can always find a hypothesis that’s consistent with the data you’ve seen, then you can learn efficiently. It’s obvious, but we want to quantify exactly how much data you need to ensure low error. This will also give us some concrete mathematical justification for philosophical claims about simplicity, and the theorems won’t change much when we generalize to VC-dimension in a future post.

The Chernoff bound

One tool we will need in this post, which shows up all across learning theory, is the Chernoff-Hoeffding bound. We covered this famous inequality in detail previously on this blog, but the part of that post we need is the following theorem that says, informally, that if you average a bunch of bounded random variables, then the probability this average random variable deviates from its expectation is exponentially small in the amount of deviation. Here’s the slightly simplified version we’ll use:

Theorem: Let $X_1, \dots, X_m$ be independent random variables whose values are in the range $[0,1]$. Call $\mu_i = \mathbf{E}[X_i]$, $X = \sum_i X_i$, and $\mu = \mathbf{E}[X] = \sum_i \mu_i$. Then for all $t > 0$,

$\displaystyle \Pr(|X-\mu| > t) \leq 2e^{-2t^2 / m}$

One nice thing about the Chernoff bound is that it doesn’t matter how the variables are distributed. This is important because in PAC we need guarantees that hold for any distribution generating data. Indeed, in our case the random variables above will be individual examples drawn from the distribution generating the data. We’ll be estimating the probability that our hypothesis has error deviating more than $\varepsilon$, and we’ll want to bound this by $\delta$, as in the definition of PAC-learning. Since the amount of deviation (error) and the number of samples ($m$) both occur in the exponent, the trick is in balancing the two values to get what we want.

Realizability and finite hypothesis classes

Let’s recall the PAC model once more. We have a distribution $D$ generating labeled examples $(x, c(x))$, where $c$ is an unknown function coming from some concept class $C$. Our algorithm can draw a polynomial number of these examples, and it must produce a hypothesis $h$ from some hypothesis class $H$ (which may or may not contain $c$). The guarantee we need is that, for any $\delta, \varepsilon > 0$, the algorithm produces a hypothesis whose error on $D$ is at most $\varepsilon$, and this event happens with probability at least $1-\delta$. All of these probabilities are taken over the randomness in the algorithm’s choices and the distribution $D$, and it has to work no matter what the distribution $D$ is.

Let’s introduce some simplifications. First, we’ll assume that the hypothesis and concept classes $H$ and $C$ are finite. Second, we’ll assume that $C \subset H$, so that you can actually hope to find a hypothesis of zero error. This is called realizability. Later we’ll relax these first two assumptions, but they make the analysis a bit cleaner. Finally, we’ll assume that we have an algorithm which, when given labeled examples, can find in polynomial time a hypothesis $h \in H$ that is consistent with every example.

These assumptions give a trivial learning algorithm: draw a bunch of examples and output any consistent hypothesis. The question is, how many examples do we need to guarantee that the hypothesis we find has the prescribed generalization error? It will certainly grow with $1 / \varepsilon$, but we need to ensure it will only grow polynomially fast in this parameter. Indeed, realizability is such a strong assumption that we can prove a polynomial bound using even more basic probability theory than the Chernoff bound.

Theorem: A algorithm that efficiently finds a consistent hypothesis will PAC-learn any finite concept class provided it has at least $m$ samples, where

$\displaystyle m \geq \frac{1}{\varepsilon} \left ( \log |H| + \log \left ( \frac{1}{\delta} \right ) \right )$

Proof. All we need to do is bound the probability that a bad hypothesis (one with error more than $\varepsilon$) is consistent with the given data. Now fix $D, c, \delta, \varepsilon$, and draw $m$ examples and let $h$ be any hypothesis that is consistent with the drawn examples. Suppose that the bad thing happens, that $\Pr_D(h(x) \neq c(x)) > \varepsilon$.

Because the examples are all drawn independently from $D$, the chance that all $m$ examples are consistent with $h$ is

$\displaystyle (1 - \Pr_{x \sim D}(h(x) \neq c(x)))^m < (1 - \varepsilon)^m$

What we’re saying here is, the probability that a specific bad hypothesis is actually consistent with your drawn examples is exponentially small in the error tolerance. So if we apply the union bound, the probability that some hypothesis you could produce is bad is at most $(1 - \varepsilon)^m S$, where $S$ is the number of hypotheses the algorithm might produce.

A crude upper bound on the number of hypotheses you could produce is just the total number of hypotheses, $|H|$. Even cruder, let’s use the inequality $(1 - x) < e^{-x}$ to give the bound

$\displaystyle (1 - \varepsilon)^m |H| < e^{-\varepsilon m} |H|$

Now we want to make sure that this probability, the probability of choosing a high-error (yet consistent) hypothesis, is at most $\delta$. So we can set the above quantity less than $\delta$ and solve for $m$:

$\displaystyle e^{-\varepsilon m} |H| \leq \delta$

Taking logs and solving for $m$ gives the desired bound.

$\square$

An obvious objection is: what if you aren’t working with a hypothesis class where you can guarantee that you’ll find a consistent hypothesis? Well, in that case we’ll need to inspect the definition of PAC again and reevaluate our measures of error. It turns out we’ll get a similar theorem as above, but with the stipulation that we’re only achieving error within epsilon of the error of the best available hypothesis.

But before we go on, this theorem has some deep philosophical interpretations. In particular, suppose that, before drawing your data, you could choose to work with one of two finite hypothesis classes $H_1, H_2$, with $|H_1| > |H_2|$. If you can find a consistent hypothesis no matter which hypothesis class you use, then this theorem says that your generalization guarantees are much stronger if you start with the smaller hypothesis class.

In other words, all else being equal, the smaller set of hypotheses is better. For this reason, the theorem is sometimes called the “Occam’s Razor” theorem. We’ll see a generalization of this theorem in the next section.

Unrealizability and an extra epsilon

Now suppose that $H$ doesn’t contain any hypotheses with error less than $\varepsilon$. What can we hope to do in this case? One thing is that we can hope to find a hypothesis whose error is within $\varepsilon$ of the minimal error of any hypothesis in $H$. Moreover, we might not have any consistent hypotheses for some data samples! So rather than require an algorithm to produce an $h \in H$ that is perfectly consistent with the data, we just need it to produce a hypothesis that has minimal empirical error, in the sense that it is as close to consistent as the best hypothesis of $h$ on the data you happened to draw. It seems like such a strategy would find you a hypothesis that’s close to the best one in $H$, but we need to prove it and determine how many samples we need to draw to succeed.

So let’s make some definitions to codify this. For a given hypothesis, call $\textup{err}(h)$ the true error of $h$ on the distribution $D$. Our assumption is that there may be no hypotheses in $H$ with $\textup{err}(h) = 0$. Next we’ll call the empirical error $\hat{\textup{err}}(h)$.

Definition: We say a concept class $C$ is agnostically learnable using the hypothesis class $H$ if for all $c \in C$ and all distributions $D$ (and all $\varepsilon, \delta > 0$), there is a learning algorithm $A$ which produces a hypothesis $h$ that with probability at least $1 - \delta$ satisfies

$\displaystyle \text{err}(h) \leq \min_{h' \in H} \text{err}(h') + \varepsilon$

and everything runs in the same sort of polynomial time as for vanilla PAC-learning. This is called the agnostic setting or the unrealizable setting, in the sense that we may not be able to find a hypothesis with perfect empirical error.

We seek to prove that all concept classes are agnostically learnable with a finite hypothesis class, provided you have an algorithm that can minimize empirical error. But actually we’ll prove something stronger.

Theorem: Let $H$ be a finite hypothesis class and $m$ the number of samples drawn. Then for any $\delta > 0$, with probability $1-\delta$ the following holds:

$\displaystyle \forall h \in H, \hat{\text{err}}(h) \leq \text{err}(h) + \sqrt{\frac{\log |H| + \log(2 / \delta)}{2m}}$

In other words, we can precisely quantify how the empirical error converges to the true error as the number of samples grows. But this holds for all hypotheses in $H$, so this provides a uniform bound of the difference between true and empirical error for the entire hypothesis class.

Proving this requires the Chernoff bound. Fix a single hypothesis $h \in H$. If you draw an example $x$, call $Z$ the random variable which is 1 when $h(x) \neq c(x)$, and 0 otherwise. So if you draw $m$ samples and call the $i$-th variable $Z_i$, the empirical error of the hypothesis is $\frac{1}{m}\sum_i Z_i$. Moreover, the actual error is the expectation of this random variable since $\mathbf{E}[1/m \sum_i Z_i] = Z$.

So what we’re asking is the probability that the empirical error deviates from the true error by a lot. Let’s call “a lot” some parameter $\varepsilon/2 > 0$ (the reason for dividing by two will become clear in the corollary to the theorem). Then plugging things into the Chernoff-Hoeffding bound gives a bound on the probability of the “bad event,” that the empirical error deviates too much.

$\displaystyle \Pr[|\hat{\text{err}}(h) - \text{err}(h)| > \varepsilon / 2] < 2e^{-\frac{\varepsilon^2m}{2}}$

Now to get a bound on the probability that some hypothesis is bad, we apply the union bound and use the fact that $|H|$ is finite to get

$\displaystyle \Pr[|\hat{\text{err}}(h) - \text{err}(h)| > \varepsilon / 2] < 2|H|e^{-\frac{\varepsilon^2m}{2}}$

Now say we want to bound this probability by $\delta$. We set $2|H|e^{-\varepsilon^2m/2} \leq \delta$, solve for $m$, and get

$\displaystyle m \geq \frac{2}{\varepsilon^2}\left ( \log |H| + \log \frac{2}{\delta} \right )$

This gives us a concrete quantification of the tradeoff between $m, \varepsilon, \delta,$ and $|H|$. Indeed, if we pick $m$ to be this large, then solving for $\varepsilon / 2$ gives the exact inequality from the theorem.

$\square$

Now we know that if we pick enough samples (polynomially many in all the parameters), and our algorithm can find a hypothesis $h$ of minimal empirical error, then we get the following corollary:

Corollary: For any $\varepsilon, \delta > 0$, the algorithm that draws $m \geq \frac{2}{\varepsilon^2}(\log |H| + \log(2/ \delta))$ examples and finds any hypothesis of minimal empirical error will, with probability at least $1-\delta$, produce a hypothesis that is within $\varepsilon$ of the best hypothesis in $H$.

Proof. By the previous theorem, with the desired probability, for all $h \in H$ we have $|\hat{\text{err}}(h) - \text{err}(h)| < \varepsilon/2$. Call $g = \min_{h' \in H} \text{err}(h')$. Then because the empirical error of $h$ is also minimal, we have $|\hat{\text{err}}(g) - \text{err}(h)| < \varepsilon / 2$. And using the previous theorem again and the triangle inequality, we get $|\text{err}(g) - \text{err}(h)| < 2 \varepsilon / 2 = \varepsilon$. In words, the true error of the algorithm’s hypothesis is close to the error of the best hypothesis, as desired.

$\square$

Next time

Both of these theorems tell us something about the generalization guarantees for learning with hypothesis classes of a certain size. But this isn’t exactly the most reasonable measure of the “complexity” of a family of hypotheses. For example, one could have a hypothesis class with a billion intervals on $\mathbb{R}$ (say you’re trying to learn intervals, or thresholds, or something easy), and the guarantees we proved in this post are nowhere near optimal.

So the question is: say you have a potentially infinite class of hypotheses, but the hypotheses are all “simple” in some way. First, what is the right notion of simplicity? And second, how can you get guarantees based on that analogous to these? We’ll discuss this next time when we define the VC-dimension.

Until then!

A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$\displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $\wedge$ denotes a logical AND and the vee $\vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $x_i$ or the negation $\overline{x_i}$, and connectives $\wedge$ and $\vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $\vee$‘s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $G$, we’ll construct a set $S_G$ of labeled truth assignments, and we’ll define the distribution $D$ to be the uniform distribution over those truth assignments used in $S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $S_G$ exactly how we labeled them, and we set the allowed error $\varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $G$ with $n$ nodes $v_1, \dots, v_n$ and a set of $m$ undirected edges $E$, we construct a set of examples with positive labels $S^+$ and one with negative examples $S^-$. The examples are truth assignments to $n$ variables, which we label $x_1, \dots, x_n$, and we identify a truth assignment to the $\left \{ 0,1 \right \}$-valued vector $(x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $S^+$ are simple: for each $v_i$ add a truth assignment $x_i = T, x_j = F$ for $j \neq i$. I.e., the binary vector is $(1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $i$-th position.

The negative examples $S^-$ come from the edges. For each edge $(v_i, v_j) \in E$, we add the example with a zero in the $i$-th and $j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

Claim: $G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $\varphi$.

Again, consistent just means that $\varphi$ is satisfied by every truth assignment in $S^+$ and unsatisfied by every example in $S^-$. Since we chose our distribution to be uniform over $S^+ \cup S^-$, we don’t care what $\varphi$ does elsewhere.

Indeed, if $G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $T_R$ be the AND of all the literals $x_i$ for which vertex $v_i$ is not red. For each such $i$, the corresponding example in $S^+$ will satisfy $T_R$, because we put a zero in the $i$-th position and ones everywhere else. Similarly, no example in $S^-$ will make $T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $(v_1,v_2)$. Then the corresponding negative example is $(0,0,1)$. Unless both $v_1$ and $v_2$ are colored red, one of $x_1, x_2$ will have to be ANDed as part of $T_R$. But the example has a zero for both $x_1$ and $x_2$, so $T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $\varphi$. We need to construct a three coloring of $G$. It goes in largely the same way: label the clauses $\varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $v_i$ the color of the clause that is satisfied by the corresponding example in $S^+$. There must be some clause that does this because $\varphi$ is consistent with $S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $(v_i, v_j)$, and both $v_i$ and $v_j$ are colored, say, blue. Look at the clause $T_B$: since $v_i$ and $v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make $T_B$ true. Since those two positive examples differ in both their $i$-th and $j$-th positions, $T_B$ can’t have any of the literals $x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $T_B$ because it has 1’s everywhere except $i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $\delta < 1/2$ and $\varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $1 / |S^+ \cup S^-|$, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $\mathsf{C}$ over a set $X$ is efficiently PAC-learnable using the hypothesis class $\mathsf{H}$ if there exists an algorithm $A(\varepsilon, \delta)$ with access to a query function for $\mathsf{C}$ and runtime $O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $c \in \mathsf{C}$, all distributions $D$ over $X$, and all $0 < \delta , \varepsilon < 1/2$, the probability that $A$ produces a hypothesis $h \in \mathsf{H}$ with error at most $\varepsilon$ is at least $1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!

Probably Approximately Correct — a Formal Theory of Learning

In tackling machine learning (and computer science in general) we face some deep philosophical questions. Questions like, “What does it mean to learn?” and, “Can a computer learn?” and, “How do you define simplicity?” and, “Why does Occam’s Razor work? (Why do simple hypotheses do well at modelling reality?)” In a very deep sense, learning theorists take these philosophical questions — or at least aspects of them — give them fleshy mathematical bodies, and then answer them with theorems and proofs. These fleshy bodies might have imperfections or they might only address one small part of a big question, but the more we think about them the closer we get to robust answers and, as a reader of this blog might find relevant, useful applications. But the glamorous big-picture stuff is an important part of the allure of learning theory.

But before we jump too far ahead of ourselves, we need to get through the basics. In this post we’ll develop the basic definitions of the theory of PAC-learning. It will be largely mathematical, but fear not: we’ll mix in a wealth of examples to clarify the austere symbols.

Leslie Valiant

Some historical notes: PAC learning was invented by Leslie Valiant in 1984, and it birthed a new subfield of computer science called computational learning theory and won Valiant some of computer science’s highest awards. Since then there have been numerous modifications of PAC learning, and also models that are entirely different from PAC learning. One other goal of learning theorists (as with computational complexity researchers) is to compare the power of different learning models. We’ll discuss this more later once we have more learning models under our belts.

If you’re interested in following along with a book, the best introduction to the subject is the first few chapters of An Introduction to Computational Learning Theory.

So let’s jump right in and see what this award-winning definition is all about.

Learning Intervals

The core idea of PAC-learnability is easy to understand, and we’ll start with a simple example to explain it. Imagine a game between two players. Player 1 generates numbers $x$ at random in some fixed way, and in Player 1’s mind he has an interval $[a,b]$. Whenever Player 1 gives out an $x$, he must also say whether it’s in the interval (that is, whether $a \leq x \leq b$). Let’s say that Player 1 gives reports a 1 if $x$ is in the interval, and a 0 otherwise. We’ll call this number the label of $x$, and call the pair of ($x$, label) a sample, or an example. We recognize that the zero and one correspond to “yes” and “no” answers to some question (Is this email spam? Does the user click on my ad? etc.), and so sometimes the labels are instead $\pm 1$, and referred to as “positive” or “negative” examples. We’ll use the positive/negative terminology here, so positive is a 1 and negative is a 0.

Player 2 (we’re on her side) sees a bunch of samples and her goal is to determine $a$ and $b$. Of course Player 2 can’t guess the interval exactly if the endpoints are real numbers, because Player 1 only gives out finitely many samples. But whatever interval Player 2 does guess at the end can be tested against Player 1’s number-producing scheme. That is, we can compute the probability that Player 2’s interval will give an incorrect label if Player 1 were to continue giving out numbers indefinitely. If this error is small (taking into account how many samples were given), then Player 2 has “learned” the interval. And if Player 2 plays this game over and over and usually wins (no matter what strategy or interval Player 1 decides to use!), then we say this problem is PAC-learnable.

PAC stands for Probably Approximately Correct, and our number guessing game makes it clear what this means. Approximately correct means the interval is close enough to the true interval that the error will be small on new samples, and Probably means that if we play the game over and over we’ll usually be able to get a good approximation. That is, we’ll find an approximately good interval with high probability

Indeed, one might already have a good algorithm in mind to learn intervals. Simply take the largest and smallest positive examples and use those as the endpoints of your interval. It’s not hard to see why this works, but if we want to prove it (or anything) is PAC-learnable, then we need to solidify these ideas with mathematical definitions.

Distributions and Hypotheses

First let’s settle the random number generation scheme. In full generality, rather than numbers we’ll just have some set $X$. Could be finite, could be infinite, no restrictions. And we’re getting samples randomly generated from $X$ according to some fixed but arbitrary and unknown distribution $D$. To be completely rigorous, the samples are independent and identically distributed (they’re all drawn from the same $D$ and independently so). This is Player 1’s dastardly decision: how should he pick his method to generate random numbers so as to bring Player 2’s algorithm to the most devastating ruin?

So then Player 1 is reduced to a choice of distribution $D$ over $X$, and since we said that Player 2’s algorithm has to do well with high probability no matter what $D$ is, then the definition becomes something like this:

A problem is PAC-learnable if there is an algorithm $A$ which will likely win the game for all distributions $D$ over $X$.

Now we have to talk about how “intervals” fit in to the general picture. Because if we’re going to talk about learning in general, we won’t always be working with intervals to make decisions. So we’re really saying that Player 1 picks some function $c$ for classifying points in $X$ as a 0 or a 1. We’ll call this a concept, or a target, and it’s the thing Player 2 is trying to learn. That is, Player 2 is producing her own function $h$ that also labels points in $X$, and we’re comparing it to $c$. We call a function generated by Player 2 a hypothesis (hence the use of the letter h).

And how can we compare them? Well, we can compute the probability that they differ. We call this the error:

$\displaystyle \textup{err}_{c,D}(h) = \textup{P}_D(h(x) \neq c(x))$

One would say this aloud: “The error of the hypothesis $h$ with respect to the concept $c$ and the distribution $D$ is the probability over $x$ drawn via $D$ that $h(x)$ and $c(x)$ differ”. Some might write the “differ” part as the symmetric difference of the two functions as sets. And then it becomes a probability density, if that’s your cup of tea (it’s not mine).

So now for a problem to be PAC-learnable we can say something like,

A problem is PAC-learnable if there is an algorithm $A$ which for any distribution $D$ and any concept $c$ will, when given some independently drawn samples and with high probability, produce a hypothesis whose error is small.

There are still a few untrimmed hedges in this definition (like “some,” “small,” and “high”), but there’s still a more important problem: there’s just too many possible concepts! Even for finite sets: there are $2^n$ $\left \{ 0,1 \right \}-$valued functions on a set of $n$ elements, and if we hope to run in polynomial time we can only possible express a miniscule fraction of those functions. Going back to the interval game, it’d be totally unreasonable to expect Player 2 to be able to get a reasonable hypothesis (using intervals or not!) if Player 1’s chosen concept is arbitrary. (The mathematician in me is imaging some crazy rule using non-measurable sets, but just suffice it to say: you might think you know everything about the real numbers, but you don’t.)

So we need to boil down what possibilities there are for the concepts $c$ and the allowed expressive power of the learner. This is what concept classes are for.

Concept Classes

concept class $\mathsf{C}$ over $X$ is a family of functions $X \to \left \{ 0,1 \right \}$. That’s all.

No, okay, there’s more to the story, but for now it’s just a shift of terminology. Now we can define the class of labeling functions induced by a choice of intervals. One might do this by taking $\mathsf{C}$ to be the set of all characteristic functions of intervals, $\chi_{[a,b]}(x) = 1$ if $a \leq x \leq b$ and 0 otherwise. Now the concept class becomes the sole focus of our algorithm. That is, the algorithm may use knowledge of the concept class to produce its hypotheses. So our working definition becomes:

A concept class $\mathsf{C}$ is PAC-learnable if there is an algorithm $A$ which, for any distribution $D$ of samples and any concept $c \in \mathsf{C}$, will with high probability produce a hypothesis $h \in \mathsf{C}$ whose error is small.

As a short prelude to future posts: we’ll be able to prove that, if the concept class is sufficiently simple (think, “low dimension”) then any algorithm that does something reasonable will be able to learn the concept class. But that will come later. Now we turn to polishing the rest of this definition.

Probably Approximately Correct Learning

We don’t want to phrase the definition in terms of games, so it’s time to remove the players from the picture. What we’re really concerned with is whether there’s an algorithm which can produce good hypotheses when given random data. But we have to solidify the “giving” process and exactly what limits are imposed on the algorithm.

It sounds daunting, but the choices are quite standard as far as computational complexity goes. Rather than say the samples come as a big data set as they might in practice, we want the algorithm to be able to decide how much data it needs. To do this, we provide it with a query function which, when accessed, spits out a sample in unit time. Then we’re interested in learning the concept with a reasonable number of calls to the query function.

And now we can iron out those words like “some” and “small” and “high” in our working definition. Since we’re going for small error, we’ll introduce a parameter $0 < \varepsilon < 1/2$ to represent our desired error bound. That is, our goal is to find a hypothesis $h$ such that $\textup{err}_{c,D}(h) \leq \varepsilon$ with high probability. And as $\varepsilon$ gets smaller and smaller (as we expect more and more of it), we want to allow our algorithm more time to run, so we limit our algorithm to run in time and space polynomial in $1/\varepsilon$.

We need another parameter to control the “high probability” part as well, so we’ll introduce $0 < \delta < 1/2$ to represent the small fraction of the time we allow our learning algorithm to have high error. And so our goal becomes to, with probability at least $1-\delta$, produce a hypothesis whose error is less than $\varepsilon$. In symbols, we want

$\textup{P}_D (\textup{err}_{c,D}(h) \leq \varepsilon) > 1 - \delta$

Note that the $\textup{P}_D$ refers to the probability over which samples you happen to get when you call the query function (and any other random choices made by the algorithm). The “high probability” hence refers to the unlikely event that you get data which is unrepresentative of the distribution generating it. Note that this is not the probability over which distribution is chosen; an algorithm which learns must still learn no matter what $D$ is.

And again as we restrict $\delta$ more and more, we want the algorithm to be allowed more time to run. So we require the algorithm runs in time polynomial in both $1/\varepsilon, 1/\delta$.

And now we have all the pieces to state the full definition.

Definition: Let $X$ be a set, and $\mathsf{C}$ be a concept class over $X$. We say that $\mathsf{C}$ is PAC-learnable if there is an algorithm $A(\varepsilon, \delta)$ with access to a query function for $\mathsf{C}$ and runtime $O(\textup{poly}(\frac{1}{\varepsilon}, \frac{1}{\delta}))$, such that for all $c \in \mathsf{C}$, all distributions $D$ over $X$, and all inputs $\varepsilon, \delta$ between 0 and $1/2$, the probability that $A$ produces a hypothesis $h$ with error at most $\varepsilon$ is at least $1- \delta$. In symbols,

$\displaystyle \textup{P}_{D}(\textup{P}_{x \sim D}(h(x) \neq c(x)) \leq \varepsilon) \geq 1-\delta$

where the first $\textup{P}_D$ is the probability over samples drawn from $D$ during the execution of the program to produce $h$. Equivalently, we can express this using the error function,

$\displaystyle \textup{P}_{D}(\textup{err}_{c,D}(h) \leq \varepsilon) \geq 1-\delta$

Excellent.

Intervals are PAC-Learnable

Now that we have this definition we can return to our problem of learning intervals on the real line. Our concept class is the set of all characteristic functions of intervals (and we’ll add in the empty set for the default case). And the algorithm we proposed to learn these intervals was quite simple: just grab a bunch of sample points, take the biggest and smallest positive examples, and use those as the endpoints of your hypothesis interval.

Let’s now prove that this algorithm can learn any interval with any distribution over real numbers. This proof will have the following form:

• Leave the number of samples you pick arbitrary, say $m$.
• Figure out the probability that the total error of our produced hypothesis is $> \varepsilon$ in terms of $m$.
• Pick $m$ to be sufficiently large that this event (failing to achieve low error) happens with small probability.

So fix any distribution $D$ over the real line and say we have our $m$ samples, we picked the max and min, and our interval is $I = [a_1,b_1]$ when the target concept is $J = [a_0, b_0]$. We can notice one thing, that our hypothesis is contained in the true interval, $I \subset J$. That’s because the sample never lie, so the largest sample we saw must be smaller than the largest possible positive example, and vice versa. In other words $a_0 < a_1 < b_1 < b_0$. And so the probability of our hypothesis producing an error is just the probability that $D$ produces a positive example in the two intervals $A = [a_0, a_1], B = [b_1, b_0]$.

This is all setup for the second bullet point above. The total error is at most the sum of the probabilities that a positive sample shows up in each of $A, B$ separately.

$\displaystyle \textup{err}_{J, D} \leq \textup{P}_{x \sim D}(x \in A) + \textup{P}_{x \sim D}(x \in B)$

Here’s a picture.

The two green intervals are the regions where error can occur.

If we can guarantee that each of the green pieces is smaller than $\varepsilon / 2$ with high probability, then we’ll be done. Let’s look at $A$, and the same argument will hold for $B$. Define $A'$ to be the interval $[a_0, y]$ which is so big that the probability that a positive example is drawn from $A'$ under $D$ is exactly $\varepsilon / 2$. Here’s another picture to clarify that.

The pink region A’ has total probability weight epsilon/2, and if the green region A is larger, we risk too much error to be PAC-learnable.

We’ll be in great shape if it’s already the case that $A \subset A'$, because that implies the probability we draw a positive example from $A$ is at most $\varepsilon / 2$. So we’re worried about the possibility that $A' \subset A$. But this can only happen if we never saw a point from $A'$ as a sample during the run of our algorithm. Since we had $m$ samples, we can compute in terms of $m$ the probability of never seeing a sample from $A'$.

The probability of a single sample not being in $A'$ is just $1 - \varepsilon/2$ (by definition!). Recalling our basic probability theory, two draws are independent events, and so the probability of missing $A'$ $m$ times is equal to the product of the probabilities of each individual miss. That is, the probability that our chosen $A$ contributes error greater than $\varepsilon / 2$ is at most

$\displaystyle \textup{P}_D(A' \subset A) \leq (1 - \varepsilon / 2)^m$

The same argument applies to $B$, so we know by the union bound that the probability of error $> \varepsilon / 2$ occurring in either $A$ or $B$ is at most the sum of the probabilities of large error in each piece, so that

$\displaystyle \textup{P}_D(\textup{err}_{J,D}(I) > \varepsilon) \leq 2(1 - \varepsilon / 2)^m$

Now for the third bullet. We want the chance that the error is big to be smaller than $\delta$, so that we’ll have low error with probability $> 1 - \delta$. So simply set

$\displaystyle 2(1 - \varepsilon / 2)^m \leq \delta$

And solve for $m$. Using the fact that $(1-x) \leq e^{-x}$ (which is proved by Taylor series), it’s enough to solve

$\displaystyle 2e^{-\varepsilon m/2} \leq \delta$,

And a fine solution is $m \geq (2 / \varepsilon \log (2 / \delta))$.

Now to cover all our bases: our algorithm simply computes $m$ for its inputs $\varepsilon, \delta$, queries that many samples, and computes the tightest-fitting interval containing the positive examples. Since the number of samples is polynomial in $1/\varepsilon, 1/\delta$ (and our algorithm doesn’t do anything complicated), we comply with the time and space bounds. And finally we just proved that the chance our algorithm will misclassify an $\varepsilon$ fraction of new points drawn from $D$ is at most $\delta$. So we have proved the theorem:

Theorem: Intervals on the real line are PAC-learnable.

$\square$

As an exercise, see if you can generalize the argument to axis-aligned rectangles in the plane. What about to arbitrary axis-aligned boxes in $d$ dimensional space? Where does $d$ show up in the number of samples needed? Is this still efficient?

However PAC-learning is far from sacred. In particular, the choice that we require a single algorithm to succeed no matter what the distribution $D$ was a deliberate choice, and it’s quite a strong requirement for a learning algorithm. There are also versions of PAC that remove other assumptions from the definition, such that the oracle for the target concept is honest (noise-free) and that there is any available hypothesis that is actually true (realizability). These all give rise to interesting learning models and discovering the relationship between the models is the end goal.