In tackling machine learning (and computer science in general) we face some deep philosophical questions. Questions like, “What does it mean to learn?” and, “Can a computer learn?” and, “How do you define simplicity?” and, “Why does Occam’s Razor work? (Why do simple hypotheses do well at modelling reality?)” In a very deep sense, learning theorists take these philosophical questions — or at least aspects of them — give them fleshy mathematical bodies, and then answer them with theorems and proofs. These fleshy bodies might have imperfections or they might only address one small part of a big question, but the more we think about them the closer we get to robust answers and, as a reader of this blog might find relevant, useful applications. But the glamorous big-picture stuff is an important part of the allure of learning theory.

But before we jump too far ahead of ourselves, we need to get through the basics. In this post we’ll develop the basic definitions of the theory of PAC-learning. It will be largely mathematical, but fear not: we’ll mix in a wealth of examples to clarify the austere symbols.

Some historical notes: PAC learning was invented by Leslie Valiant in 1984, and it birthed a new subfield of computer science called *computational learning theory* and won Valiant some of computer science’s highest awards. Since then there have been numerous modifications of PAC learning, and also models that are entirely different from PAC learning. One other goal of learning theorists (as with computational complexity researchers) is to compare the power of different learning models. We’ll discuss this more later once we have more learning models under our belts.

If you’re interested in following along with a book, the best introduction to the subject is the first few chapters of An Introduction to Computational Learning Theory.

So let’s jump right in and see what this award-winning definition is all about.

## Learning Intervals

The core idea of PAC-learnability is easy to understand, and we’ll start with a simple example to explain it. Imagine a game between two players. Player 1 generates numbers $ x$ at random in some fixed way, and in Player 1’s mind he has an interval $ [a,b]$. Whenever Player 1 gives out an $ x$, he must also say whether it’s in the interval (that is, whether $ a \leq x \leq b$). Let’s say that Player 1 gives reports a 1 if $ x$ is in the interval, and a 0 otherwise. We’ll call this number the *label* of $ x$, and call the pair of ($ x$, label) a *sample, *or an *example*.* *We recognize that the zero and one correspond to “yes” and “no” answers to some question (Is this email spam? Does the user click on my ad? etc.), and so sometimes the labels are instead $ \pm 1$, and referred to as “positive” or “negative” examples. We’ll use the positive/negative terminology here, so positive is a 1 and negative is a 0.

Player 2 (we’re on her side) sees a bunch of samples and her goal is to determine $ a$ and $ b$. Of course Player 2 can’t guess the interval exactly if the endpoints are real numbers, because Player 1 only gives out finitely many samples. But whatever interval Player 2 *does* guess at the end can be tested against Player 1’s number-producing scheme. That is, we can compute the probability that Player 2’s interval will give an incorrect label if Player 1 were to continue giving out numbers indefinitely. If this error is small (taking into account how many samples were given), then Player 2 has “learned” the interval. And if Player 2 plays this game over and over and usually wins (no matter what strategy or interval Player 1 decides to use!), then we say this problem is PAC-learnable.

PAC stands for Probably Approximately Correct, and our number guessing game makes it clear what this means. Approximately correct means the interval is close enough to the true interval that the error will be small on new samples, and Probably means that if we play the game over and over we’ll usually be able to get a good approximation. That is, we’ll find an approximately good interval *with high probability*.

Indeed, one might already have a good algorithm in mind to learn intervals. Simply take the largest and smallest positive examples and use those as the endpoints of your interval. It’s not hard to see why this works, but if we want to *prove* it (or anything) is PAC-learnable, then we need to solidify these ideas with mathematical definitions.

## Distributions and Hypotheses

First let’s settle the random number generation scheme. In full generality, rather than numbers we’ll just have some set $ X$. Could be finite, could be infinite, no restrictions. And we’re getting samples randomly generated from $ X$ according to some fixed but arbitrary and unknown distribution $ D$. To be completely rigorous, the samples are independent and identically distributed (they’re all drawn from the same $ D$ and independently so). This is Player 1’s dastardly decision: how should he pick his method to generate random numbers so as to bring Player 2’s algorithm to the most devastating ruin?

So then Player 1 is reduced to a choice of distribution $ D$ over $ X$, and since we said that Player 2’s algorithm has to do well with high probability no matter what $ D$ is, then the definition becomes something like this:

A problem is PAC-learnable if there is an algorithm $ A$ which will likely win the game for all distributions $ D$ over $ X$.

Now we have to talk about how “intervals” fit in to the general picture. Because if we’re going to talk about learning in general, we won’t always be working with intervals to make decisions. So we’re really saying that Player 1 picks some function $ c$ for classifying points in $ X$ as a 0 or a 1. We’ll call this a *concept*, or a *target**, *and it’s the thing Player 2 is trying to learn. That is, Player 2 is producing her own function $ h$ that also labels points in $ X$, and we’re comparing it to $ c$. We call a function generated by Player 2 a *hypothesis *(hence the use of the letter h).

And how can we compare them? Well, we can compute the probability that they differ. We call this the error:

$ \displaystyle \textup{err}_{c,D}(h) = \textup{P}_D(h(x) \neq c(x))$

One would say this aloud: “The *error* of the hypothesis $ h$ with respect to the concept $ c$ and the distribution $ D$ is the probability over $ x$ drawn via $ D$ that $ h(x)$ and $ c(x)$ differ”. Some might write the “differ” part as the symmetric difference of the two functions as *sets*. And then it becomes a probability density, if that’s your cup of tea (it’s not mine).

So now for a problem to be PAC-learnable we can say something like,

A problem is PAC-learnable if there is an algorithm $ A$ which for any distribution $ D$ and any concept $ c$ will, when given some independently drawn samples and with high probability, produce a hypothesis whose error is small.

There are still a few untrimmed hedges in this definition (like “some,” “small,” and “high”), but there’s still a more important problem: there’s just too many possible concepts! Even for finite sets: there are $ 2^n$ $ \left \{ 0,1 \right \}-$valued functions on a set of $ n$ elements, and if we hope to run in polynomial time we can only possible express a miniscule fraction of those functions. Going back to the interval game, it’d be totally unreasonable to expect Player 2 to be able to get a reasonable hypothesis (using intervals or not!) if Player 1’s chosen concept is arbitrary. (The mathematician in me is imaging some crazy rule using non-measurable sets, but just suffice it to say: you might think you know everything about the real numbers, but you don’t.)

So we need to boil down what possibilities there are for the concepts $ c$ and the allowed expressive power of the learner. This is what concept classes are for.

## Concept Classes

A *concept class* $ \mathsf{C}$ over $ X$ is a family of functions $ X \to \left \{ 0,1 \right \}$. That’s all.

No, okay, there’s more to the story, but for now it’s just a shift of terminology. Now we can define the class of labeling functions induced by a choice of intervals. One might do this by taking $ \mathsf{C}$ to be the set of all characteristic functions of intervals, $ \chi_{[a,b]}(x) = 1$ if $ a \leq x \leq b$ and 0 otherwise. Now the *concept class* becomes the sole focus of our algorithm. That is, the algorithm may use knowledge of the concept class to produce its hypotheses. So our working definition becomes:

A concept class $ \mathsf{C}$ is PAC-learnable if there is an algorithm $ A$ which, for any distribution $ D$ of samples and any concept $ c \in \mathsf{C}$, will with high probability produce a hypothesis $ h \in \mathsf{C}$ whose error is small.

As a short prelude to future posts: we’ll be able to prove that, if the concept class is sufficiently simple (think, “low dimension”) then any algorithm that does something reasonable will be able to learn the concept class. But that will come later. Now we turn to polishing the rest of this definition.

## Probably Approximately Correct Learning

We don’t want to phrase the definition in terms of games, so it’s time to remove the players from the picture. What we’re really concerned with is whether there’s an algorithm which can produce good hypotheses when given random data. But we have to solidify the “giving” process and exactly what limits are imposed on the algorithm.

It sounds daunting, but the choices are quite standard as far as computational complexity goes. Rather than say the samples come as a big data set as they might in practice, we want the algorithm to be able to decide how much data it needs. To do this, we provide it with a *query function *which, when accessed, spits out a sample in unit time. Then we’re interested in learning the concept with a reasonable number of calls to the query function.

And now we can iron out those words like “some” and “small” and “high” in our working definition. Since we’re going for small error, we’ll introduce a parameter $ 0 < \varepsilon < 1/2$ to represent our desired error bound. That is, our goal is to find a hypothesis $ h$ such that $ \textup{err}_{c,D}(h) \leq \varepsilon $ with high probability. And as $ \varepsilon$ gets smaller and smaller (as we expect more and more of it), we want to allow our algorithm more time to run, so we limit our algorithm to run in time and space polynomial in $ 1/\varepsilon$.

We need another parameter to control the “high probability” part as well, so we’ll introduce $ 0 < \delta < 1/2$ to represent the small fraction of the time we allow our learning algorithm to have high error. And so our goal becomes to, with probability at least $ 1-\delta$, produce a hypothesis whose error is less than $ \varepsilon$. In symbols, we want

$ \textup{P}_D (\textup{err}_{c,D}(h) \leq \varepsilon) > 1 – \delta$

Note that the $ \textup{P}_D$ refers to the probability over which samples you happen to get when you call the query function (and any other random choices made by the algorithm). The “high probability” hence refers to the unlikely event that you get data which is unrepresentative of the distribution generating it. Note that this is *not *the probability over which distribution is chosen; an algorithm which learns must still learn no matter what $ D$ is.

And again as we restrict $ \delta$ more and more, we want the algorithm to be allowed more time to run. So we require the algorithm runs in time polynomial in both $ 1/\varepsilon, 1/\delta$.

And now we have all the pieces to state the full definition.

**Definition: **Let $ X$ be a set, and $ \mathsf{C}$ be a concept class over $ X$. We say that $ \mathsf{C}$ is *PAC-learnable* if there is an algorithm $ A(\varepsilon, \delta)$ with access to a query function for $ \mathsf{C}$ and runtime $ O(\textup{poly}(\frac{1}{\varepsilon}, \frac{1}{\delta}))$, such that for all $ c \in \mathsf{C}$, all distributions $ D$ over $ X$, and all inputs $ \varepsilon, \delta$ between 0 and $ 1/2$, the probability that $ A$ produces a hypothesis $ h$ with error at most $ \varepsilon$ is at least $ 1- \delta$. In symbols,

$ \displaystyle \textup{P}_{D}(\textup{P}_{x \sim D}(h(x) \neq c(x)) \leq \varepsilon) \geq 1-\delta$

where the first $ \textup{P}_D$ is the probability over samples drawn from $ D$ during the execution of the program to produce $ h$. Equivalently, we can express this using the error function,

$ \displaystyle \textup{P}_{D}(\textup{err}_{c,D}(h) \leq \varepsilon) \geq 1-\delta$

Excellent.

## Intervals are PAC-Learnable

Now that we have this definition we can return to our problem of learning intervals on the real line. Our concept class is the set of all characteristic functions of intervals (and we’ll add in the empty set for the default case). And the algorithm we proposed to learn these intervals was quite simple: just grab a bunch of sample points, take the biggest and smallest positive examples, and use those as the endpoints of your hypothesis interval.

Let’s now *prove* that this algorithm can learn any interval with any distribution over real numbers. This proof will have the following form:

- Leave the number of samples you pick arbitrary, say $ m$.
- Figure out the probability that the total error of our produced hypothesis is $ > \varepsilon$ in terms of $ m$.
- Pick $ m$ to be sufficiently large that this event (failing to achieve low error) happens with small probability.

So fix any distribution $ D$ over the real line and say we have our $ m$ samples, we picked the max and min, and our interval is $ I = [a_1,b_1]$ when the target concept is $ J = [a_0, b_0]$. We can notice one thing, that our hypothesis is contained in the true interval, $ I \subset J$. That’s because the sample never lie, so the largest sample we saw must be smaller than the largest possible positive example, and vice versa. In other words $ a_0 < a_1 < b_1 < b_0$. And so the probability of our hypothesis producing an error is just the probability that $ D$ produces a positive example in the two intervals $ A = [a_0, a_1], B = [b_1, b_0]$.

This is all setup for the second bullet point above. The total error is at most the sum of the probabilities that a positive sample shows up in each of $ A, B$ separately.

$ \displaystyle \textup{err}_{J, D} \leq \textup{P}_{x \sim D}(x \in A) + \textup{P}_{x \sim D}(x \in B)$

Here’s a picture.

If we can guarantee that each of the green pieces is smaller than $ \varepsilon / 2$ with high probability, then we’ll be done. Let’s look at $ A$, and the same argument will hold for $ B$. Define $ A’$ to be the interval $ [a_0, y]$ which is so big that the probability that a positive example is drawn from $ A’$ under $ D$ is *exactly* $ \varepsilon / 2$. Here’s another picture to clarify that.

We’ll be in great shape if it’s already the case that $ A \subset A’$, because that implies the probability we draw a positive example from $ A$ is at most $ \varepsilon / 2$. So we’re worried about the possibility that $ A’ \subset A$. But this can only happen if we never saw a point from $ A’$ as a sample during the run of our algorithm. Since we had $ m$ samples, we can compute in terms of $ m$ the probability of never seeing a sample from $ A’$.

The probability of a single sample not being in $ A’$ is just $ 1 – \varepsilon/2$ (by definition!). Recalling our basic probability theory, two draws are independent events, and so the probability of missing $ A’$ $ m$ times is equal to the product of the probabilities of each individual miss. That is, the probability that our chosen $ A$ contributes error greater than $ \varepsilon / 2$ is at most

$ \displaystyle \textup{P}_D(A’ \subset A) \leq (1 – \varepsilon / 2)^m$

The same argument applies to $ B$, so we know by the union bound that the probability of error $ > \varepsilon / 2$ occurring in either $ A$ or $ B$ is at most the sum of the probabilities of large error in each piece, so that

$ \displaystyle \textup{P}_D(\textup{err}_{J,D}(I) > \varepsilon) \leq 2(1 – \varepsilon / 2)^m$

Now for the third bullet. We want the chance that the error is big to be smaller than $ \delta$, so that we’ll have low error with probability $ > 1 – \delta$. So simply set

$ \displaystyle 2(1 – \varepsilon / 2)^m \leq \delta$

And solve for $ m$. Using the fact that $ (1-x) \leq e^{-x}$ (which is proved by Taylor series), it’s enough to solve

$ \displaystyle 2e^{-\varepsilon m/2} \leq \delta$,

And a fine solution is $ m \geq (2 / \varepsilon \log (2 / \delta))$.

Now to cover all our bases: our algorithm simply computes $ m$ for its inputs $ \varepsilon, \delta$, queries that many samples, and computes the tightest-fitting interval containing the positive examples. Since the number of samples is polynomial in $ 1/\varepsilon, 1/\delta$ (and our algorithm doesn’t do anything complicated), we comply with the time and space bounds. And finally we just proved that the chance our algorithm will misclassify an $ \varepsilon$ fraction of new points drawn from $ D$ is at most $ \delta$. So we have proved the theorem:

**Theorem: **Intervals on the real line are PAC-learnable.

$ \square$

As an exercise, see if you can generalize the argument to axis-aligned rectangles in the plane. What about to arbitrary axis-aligned boxes in $ d$ dimensional space? Where does $ d$ show up in the number of samples needed? Is this still efficient?

## Comments and Previews

There are a few more technical details we’ve ignored in the course of this post, but the important idea are clear. We have a formal model of learning which allows for certain pre-specified levels of imperfection, and we proved that one can learn how to recognize intervals in this model. It’s a far cry from decision trees and neural networks, but it’s a solid foundation to build upon.

However, the definition we presented here for PAC-learning is not quite complete. It turns out, as we’ll see in the next post, that forcing the PAC-learning algorithm to produce hypotheses from the same concept class it’s trying to learn makes some problems that should be easy hard. This is just because it could require the algorithm to represent some simple hypothesis in a convoluted form, and in the next post we’ll see that this is not an idle threat, and we’ll make a slight modification to the PAC definition.

However PAC-learning is far from sacred. In particular, the choice that we require a single algorithm to succeed no matter what the distribution $ D$ was a deliberate choice, and it’s quite a strong requirement for a learning algorithm. There are also versions of PAC that remove other assumptions from the definition, such that the oracle for the target concept is honest (noise-free) and that there is any available hypothesis that is actually true (realizability). These all give rise to interesting learning models and discovering the relationship between the models is the end goal.

And so the kinds of questions we ask are: can we classify all PAC-learnable problems? Can we find a meta-algorithm that would work on any PAC-learnable concept class given some assumptions? How does PAC-learning relate to other definitions of learning? Say, one where we don’t require it to work for every distribution; would that *really* allow us to solve more problems?

It’s a question of finding out the deep truths of mathematics now, but we promise that this series will soon come back around to practical applications, for learning theory naturally entails the design and analysis of fascinating algorithms.

Until next time!

the set of F:X→{0,1} is in bijection with P(X) because you take something from the power set, give it all the 1 values, and give the complement the 0 values

That’s true, but we rarely think of it like that because we rarely allow for ALL functions F. Rather, we have to limit the kinds of functions we consider into one kind of representation class. In learning “intervals” the distinction isn’t so clear, but you can do the same thing with decision trees, for example, or neural networks. There appears to be no easy way to describe the corresponding subset.

But the argument of the OP shows that there are only 2^n functions on a set with n elements and values in {0,1}, and not 2^(2^n) as claimed in the article. Or am I missing something here?

I misinterpreted the original comment. And yes you’re right; I think I was mixing things up when I wrote that because I was thinking a lot about circuits at the time. There are 2^2^n boolean functions on n variables, and 2^n {0,1}-valued functions on a set. It’s fixed now.

Thanks for the elucidated article. I might sound dumb, but – is it in any way related to Hoeffding inequality?

Hoeffding is often used in learning theory. For example, it’s what enables boosting to work. I’m working on some follow-up articles to this one, and we will definitely be using Hoeffding.

K…thanks waiting for the post on Hoeffding !

To be sure, I already have a post on Hoeffding itself, just not on where it’s used in learning theory.

In the paragraph starting with “The probability of a single sample not being in A”, shouldn’t be the first two occurrences of A be actually A’?

Quite right. Thanks for that correction!

I Googled “PAC learning”, because I am conducting research for an article on sensory evolution and genetic learning. Google returned a list of various academic papers and blogs on the subject. Your blog was fifth on the list, but it was by far the best in terms of readability.

Hello Jeremy,

My name is Carlos Acosta. I live in Nipomo California, not too far from San Luis Obispo and Cal-Poly. I am not a mathematician, but rather a process philosopher investigating

the physiological evolution of sensations and perceptions and conscious thought. Please Google: (Acosta, 2012, pp. 75-113 and Acosta, 2006, pp.151-165). Each of these papers

explores the idea that conscious thought results from an evolutionary process of genetic information accumulation and approximation, and that all thought is mostly composed

of complex aggregates of very small genetically fixed assumptions, i.e., micro-predictions concerning the underlying nature of reality, similar to the inductive hypotheses of science (Gregory,1968 a & b, 1970, 1980, 1981, & 1997). Moreover, these micro-inferences all emerge within multiple models of reality that have a structure and function similar to that of a game.

I respectfully request permission to use the content of your PAC learning blog in the paper I mentioned previously.

Please contact me at your earliest convenience. I can be reached by email at:

carlosacosta.new@gmail.com

All the best,

Carlos

You’re welcome to use whatever you want from my blog with proper attribution. Perhaps you would be more interested in the primary source, Leslie Valiant, who invented all of this stuff. He wrote a book called “Probably Approximately Correct” (http://www.amazon.com/Probably-Approximately-Correct-Algorithms-Prospering/dp/0465032710) which deals more directly with the relationship between his theory of learning and evolution and genetics.

Thanks for nice explanation. I’m a newbie student, and I had some confusion while reading your blog. Could you answer to the following question?

In the “Intervals are PAC-Learnable” section, don’t we need to consider the case when our samples never lie in $J$? I think calculation of the error probability should take this “tragic case” into the consideration. As I understood, “the polynomial function” of $1/\epsilon$ and $1/\delta$ should not depend on particular choice of $J$ or $D$, because if there is a dependency, then essentially there is no way to determine appropriate number of samples; but if we consider the “tragic case”, this seems to be impossible to achieve. Is it right?

If all examples are outside of J, then you never see any positively labeled examples in your entire sample. I agree we have to consider this case, but if the probability weight of the region for J is smaller than , then the hypothesis consisting of the empty interval is a correct answer. Otherwise, the probability of never drawing a positive example is at most , and our choice of ensures that this quantity is less than , which means it works for PAC.

Oh, you already wrote an answer! Thank you!

Ah.. the confusion was resolved.

If probability that a sample belongs to $J$ is smaller than $\epsilon$, then the “tragic case” is not an error case

If probability that a sample belongs to $J$ is larger than $\epsilon$, then the tragic case only occurs with probability less than $(1-\epsilon)^m$

Thanks anyway!

Thanks for explanation!

I am not a native English speaker, so some mistake would be made in my writing. Hope you will forgive that. I am a newcomer to computational learning theary, and I am a littile bit confused reading the section “Intervals are PAC-Learnable”, so I write my question below. It would be so kind of you if you can anwser this question.

Firstly, you tell in this section that “The total error is at most the sum of the probabilities that a positive sample shows up in each of A, B separately”, which I interprete as that “the total error is equal to or less than the sum of the probabilities that a positive sample shows up in each of A, B separately”. But, since that the intersection of A and B is an empty set, is it right to replace “equal to or less than” with just “equal to”?

Secondly, I am thinking the idea to guarantee that each of the green pieces is both smaller than epsilon / 2 is an extra condition beyond the condition that the Intervals problem should be imposed on. Though I agree that it should work either, is it reasonable to think that the example size m could be smaller than 2 / ε log2(1 / δ) and can stil guarantee that the same hypothesis whose error is the same could be produced with the same probability?

> Firstly…

The rigorous question I was asking when I said that is: “what is the total probability weight outside my hypothesis interval, but within the true interval?” So yes, in this case they are disjoint regions and so the sum is equal to the sum of the parts. If you were to upgrade this example to, say, rectangles in the plane, the same error bound analysis would work, but the “error regions” would be overlapping. This is why I used an inequality here.

> Secondly…

I think provided you cannot improve the sample size by an order of magnitude (in either epsilon or delta), then from a learning theory perspective it is not so interesting. Usually, optimal sample sizes are most important by order of magnitude. So unless you think a new technique could do better than improving sample size by a constant factor, putting in the extra work may not be worth it.

In a related note, if you are interested in exciting and new research in PAC learning, this paper from 2016 establishes the optimal (generic) PAC learning sample sizes http://www.jmlr.org/papers/volume17/15-389/15-389.pdf. It turns out, 1/epsilon (d + log(1/delta)) is optimal, where d is the VC-dimension of the hypothesis space. I should really write a blog post on that 🙂

It’s clear to me now. And thank you for informing about that paper, I will try to read it.

Hi Jeremy!

It’s a nice post. I followed some papers on PAC and decided to implement it on a sample dataset, but was facing issues in generating negative counterexamples. Guessed that there has to be some intelligent way to sample subsets from the attribute set. It would be great if you could throw some light on it.

You mean for intervals? What exactly are you trying to implement?

Hello Jeremy!

Thanks for the great article, I feel like I finally gained some intuition on this subject.

There is still one step that I cannot follow tough. I tried to derive the minimum sample size m myself before reading the article, but I failed to come up with the formula

2*(1 – e/2)^m

for the probability. What is the reasoning for making the left and right difference between the intervals exactly e/2 each? Why is any other choice like say

(1 – 2/3 * e)^m + (1 – 1/3 * e)^m

not plausible?

The choices in these proofs are largely a balance between making the proof work and making the proof easy.

With you can compute a logarithm later when solving for in terms of (whereas computing a log of a sum is not as trivial). We specifically chose the intervals as we did, so that they give you the same probability, with an eye toward making that inequality easy to solve.

My apologies for not making that clear in the article.

That makes sense, thanks for the explanation!

Dear Jeremy,

How come the probability in the PAC definition does not depend on the sample size? If it is true for all delta, it is zero after all, isn’t it?

Best regards,

Enno

It is only true for positive delta. The PAC definition does not depend on the sample size, but the sample size that shows up in theorems depends on 1/delta.

Thanks for explanation!

I am not a native speaker，either. And I’ve read Garrett’s reply in February 21,2017, but there’s still a little question bothering me.

Question:

The probability of a positive point lies in the green interval (A or B) is equal to 0.5 ε. Why not 0.4 epsilon and 0.6 ε? Why don’t we just consider A and B as a whole? Then the probability of not getting a point from A∪B is epsilon, we have (1-ε)^n < δ .For higher dimensional case, I still want to use this method.

————————————————————————————————————————

I must be terribly wrong but I just couldn't figure it out. So would you help me on this one?Anyway, thanks for this post, I've learnd much.