In tackling machine learning (and computer science in general) we face some deep philosophical questions. Questions like, “What does it mean to learn?” and, “Can a computer learn?” and, “How do you define simplicity?” and, “Why does Occam’s Razor work? (Why do simple hypotheses do well at modelling reality?)” In a very deep sense, learning theorists take these philosophical questions — or at least aspects of them — give them fleshy mathematical bodies, and then answer them with theorems and proofs. These fleshy bodies might have imperfections or they might only address one small part of a big question, but the more we think about them the closer we get to robust answers and, as a reader of this blog might find relevant, useful applications. But the glamorous big-picture stuff is an important part of the allure of learning theory.
But before we jump too far ahead of ourselves, we need to get through the basics. In this post we’ll develop the basic definitions of the theory of PAC-learning. It will be largely mathematical, but fear not: we’ll mix in a wealth of examples to clarify the austere symbols.
Some historical notes: PAC learning was invented by Leslie Valiant in 1984, and it birthed a new subfield of computer science called computational learning theory and won Valiant some of computer science’s highest awards. Since then there have been numerous modifications of PAC learning, and also models that are entirely different from PAC learning. One other goal of learning theorists (as with computational complexity researchers) is to compare the power of different learning models. We’ll discuss this more later once we have more learning models under our belts.
If you’re interested in following along with a book, the best introduction to the subject is the first few chapters of An Introduction to Computational Learning Theory.
So let’s jump right in and see what this award-winning definition is all about.
Learning Intervals
The core idea of PAC-learnability is easy to understand, and we’ll start with a simple example to explain it. Imagine a game between two players. Player 1 generates numbers
Player 2 (we’re on her side) sees a bunch of samples and her goal is to determine
PAC stands for Probably Approximately Correct, and our number guessing game makes it clear what this means. Approximately correct means the interval is close enough to the true interval that the error will be small on new samples, and Probably means that if we play the game over and over we’ll usually be able to get a good approximation. That is, we’ll find an approximately good interval with high probability.
Indeed, one might already have a good algorithm in mind to learn intervals. Simply take the largest and smallest positive examples and use those as the endpoints of your interval. It’s not hard to see why this works, but if we want to prove it (or anything) is PAC-learnable, then we need to solidify these ideas with mathematical definitions.
Distributions and Hypotheses
First let’s settle the random number generation scheme. In full generality, rather than numbers we’ll just have some set
So then Player 1 is reduced to a choice of distribution
A problem is PAC-learnable if there is an algorithm
Now we have to talk about how “intervals” fit in to the general picture. Because if we’re going to talk about learning in general, we won’t always be working with intervals to make decisions. So we’re really saying that Player 1 picks some function
And how can we compare them? Well, we can compute the probability that they differ. We call this the error:
One would say this aloud: “The error of the hypothesis
So now for a problem to be PAC-learnable we can say something like,
A problem is PAC-learnable if there is an algorithm
which for any distribution and any concept will, when given some independently drawn samples and with high probability, produce a hypothesis whose error is small.
There are still a few untrimmed hedges in this definition (like “some,” “small,” and “high”), but there’s still a more important problem: there’s just too many possible concepts! Even for finite sets: there are
So we need to boil down what possibilities there are for the concepts
Concept Classes
A concept class
No, okay, there’s more to the story, but for now it’s just a shift of terminology. Now we can define the class of labeling functions induced by a choice of intervals. One might do this by taking
A concept class
As a short prelude to future posts: we’ll be able to prove that, if the concept class is sufficiently simple (think, “low dimension”) then any algorithm that does something reasonable will be able to learn the concept class. But that will come later. Now we turn to polishing the rest of this definition.
Probably Approximately Correct Learning
We don’t want to phrase the definition in terms of games, so it’s time to remove the players from the picture. What we’re really concerned with is whether there’s an algorithm which can produce good hypotheses when given random data. But we have to solidify the “giving” process and exactly what limits are imposed on the algorithm.
It sounds daunting, but the choices are quite standard as far as computational complexity goes. Rather than say the samples come as a big data set as they might in practice, we want the algorithm to be able to decide how much data it needs. To do this, we provide it with a query function which, when accessed, spits out a sample in unit time. Then we’re interested in learning the concept with a reasonable number of calls to the query function.
And now we can iron out those words like “some” and “small” and “high” in our working definition. Since we’re going for small error, we’ll introduce a parameter
We need another parameter to control the “high probability” part as well, so we’ll introduce
Note that the
And again as we restrict
And now we have all the pieces to state the full definition.
Definition: Let
where the first
Intervals are PAC-Learnable
Now that we have this definition we can return to our problem of learning intervals on the real line. Our concept class is the set of all characteristic functions of intervals (and we’ll add in the empty set for the default case). And the algorithm we proposed to learn these intervals was quite simple: just grab a bunch of sample points, take the biggest and smallest positive examples, and use those as the endpoints of your hypothesis interval.
Let’s now prove that this algorithm can learn any interval with any distribution over real numbers. This proof will have the following form:
- Leave the number of samples you pick arbitrary, say
. - Figure out the probability that the total error of our produced hypothesis is
in terms of . - Pick
to be sufficiently large that this event (failing to achieve low error) happens with small probability.
So fix any distribution
This is all setup for the second bullet point above. The total error is at most the sum of the probabilities that a positive sample shows up in each of
Here’s a picture.
If we can guarantee that each of the green pieces is smaller than
We’ll be in great shape if it’s already the case that
The probability of a single sample not being in
The same argument applies to
Now for the third bullet. We want the chance that the error is big to be smaller than
And solve for
And a fine solution is
Now to cover all our bases: our algorithm simply computes
Theorem: Intervals on the real line are PAC-learnable.
As an exercise, see if you can generalize the argument to axis-aligned rectangles in the plane. What about to arbitrary axis-aligned boxes in
Comments and Previews
There are a few more technical details we’ve ignored in the course of this post, but the important idea are clear. We have a formal model of learning which allows for certain pre-specified levels of imperfection, and we proved that one can learn how to recognize intervals in this model. It’s a far cry from decision trees and neural networks, but it’s a solid foundation to build upon.
However, the definition we presented here for PAC-learning is not quite complete. It turns out, as we’ll see in the next post, that forcing the PAC-learning algorithm to produce hypotheses from the same concept class it’s trying to learn makes some problems that should be easy hard. This is just because it could require the algorithm to represent some simple hypothesis in a convoluted form, and in the next post we’ll see that this is not an idle threat, and we’ll make a slight modification to the PAC definition.
However PAC-learning is far from sacred. In particular, the choice that we require a single algorithm to succeed no matter what the distribution
And so the kinds of questions we ask are: can we classify all PAC-learnable problems? Can we find a meta-algorithm that would work on any PAC-learnable concept class given some assumptions? How does PAC-learning relate to other definitions of learning? Say, one where we don’t require it to work for every distribution; would that really allow us to solve more problems?
It’s a question of finding out the deep truths of mathematics now, but we promise that this series will soon come back around to practical applications, for learning theory naturally entails the design and analysis of fascinating algorithms.
Until next time!
Want to respond? Send me an email, post a webmention, or find me elsewhere on the internet.