Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability p. If X denotes the number of trials needed to get the first success, then clearly \Pr(X = k) = (1-p)^{k-1} p (since first we need k-1 failures which occur independently with probability 1-p, then we need one success which happens with probability p). Thus the expected value of X is

\displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p

by basic calculus. In particular, if success is defined as getting a six, then p=1/6 thus the expected time is 1/p=6.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is \frac{1}{26}, thus the probability of a random eleven-letter word being ABRACADABRA is exactly \left(\frac{1}{26}\right)^{11}. So if typing 11 letters is one trial, the expected number of trials is

\displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}

which means 11\cdot 26^{11} keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that 11\cdot 26^{11} is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses $1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive $26 if he wins, since the probability of getting the next letter right is exactly \frac{1}{26} (thus the expected value of the change in the gambler’s fortune is \frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as T. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly T dollars.

How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins $26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins \$26^4. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be \$26^{11}. Thus our casino will have to give out 26^{11}+26^4+26 dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is T dollars, the expected value of our expenses is 26^{11}+26^4+26 dollars, thus E(T)=26^{11}+26^4+26. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. X_0 will denote the gambler’s fortune before the game starts, X_1 the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of X_n, given X_1, X_2, \ldots, X_{n-1} is the same as X_{n-1}. This can be written as E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1} or, equivalently, E(X_n - X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let X'_n = X_n if n < T and X'_n = X_T if n \ge T where T denotes the stopping time, i.e. the time at which the desired event occurs. Notice that T itself is a random variable.

We require our stopping time T to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let X_n be a martingale stopped at step T, and suppose one of the following three conditions hold:

  1. The stopping time T is almost surely bounded by some constant;
  2. The stopping time T is almost surely finite and every step of the stopped martingale X_n is almost surely bounded by some constant; or
  3. The expected stopping time E(T) is finite and the absolute value of the martingale increments |X_n-X_{n-1}| are almost surely bounded by a constant.

Then E(X_T) = E(X_0).

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than 26^{12}) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by 26^{11}+26^4+26. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has a dollars, the second player has b dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let X_n denote the change in the second player’s fortune, and set X_0 = 0. Let T_k denote the first time s when X_s = k. Then our first question can be formalized as trying to determine \Pr(T_{-b} < T_a). Let t = \min \{ T_{-b}, T_a\}. Clearly t is a stopping time. By the optional stopping theorem we have that

\displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))

thus \Pr(T_{-b} < T_a)=\frac{a}{a+b}.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: X_n^2-n is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with k-SAT for some k \ne 3 instead? Clearly the hardness of the problem is monotone increasing in k since k-SAT is a special case of (k+1)-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as k-SAT for any k>3. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals x and y are connected by a directed edge if there is a clause (\bar x \lor y). Recall that \bar x \lor y is equivalent to x \implies y, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths x \to \bar x and \bar x \to x (since x \Leftrightarrow \bar x is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after O(n^2) rounds where n denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in O(n^2) steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by \pm 1 in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least \frac{1}{2}. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond n. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number n, on the other hand, is called a reflecting barrier: we cannot reach n+1, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in O(n^2) steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

About these ads

Simulating a Biased Coin with a Fair Coin

This is a guest post by my friend and colleague Adam Lelkes. Adam’s interests are in algebra and theoretical computer science. This gem came up because Adam gave a talk on probabilistic computation in which he discussed this technique.

Problem: simulate a biased coin using a fair coin.

Solution: (in Python)

def biasedCoin(binaryDigitStream, fairCoin):
   for d in binaryDigitStream:
      if fairCoin() != d:
         return d

Discussion: This function takes two arguments, an iterator representing the binary expansion of the intended probability of getting 1 (let us denote it as p) and another function that returns 1 or 0 with equal probability. At first glance this might seem like an overcomplicated way of solving this problem: why can’t the probability be a floating point number?

The point is that p can have infinite precision! Assuming that fairCoin() gives us a perfectly random stream of 1′s and 0′s (independently and with probability 1/2) and we can read each bit of the binary expansion of p, this function returns 1 with probability exactly p even if p is irrational or a fraction with infinite decimal expansion. If we used floating point arithmetic there would be a small chance we get unlucky and exhaust the precision available. We would only get an approximation of the true bias at best.

Now let us explain why this algorithm works. We keep tossing our fair coins to get a sequence of random bits, until one of our random bits is different from the corresponding bit in the binary expansion of p. If we stop after i steps, that means that the first i-1 bits in the two binary sequences were the same, which happens with probability \frac{1}{2^{i-1}}. Given that this happens, in the ith step we will return the ith bit of p; let us denote this bit by p_i. Then the probability of returning 1 is \sum_{i=1}^\infty \frac{p_i}{2^{i-1}}, which is the binary expansion of p.

This algorithm is also efficient. By efficient here we mean that the expected running time is constant. Of course, to show this we need to make some assumption about the computational complexity of calculating the bits of p. If we assume that the bits of p are efficiently computable in the sense that the time required to compute p_i is bounded by a polynomial in i, then this algorithm does run in constant expected time.

Indeed, the expected running time is \sum_{i=0}^\infty \frac{i^n}{2^i}. Showing that this sum is a constant is an easy calculus exercise: using the ratio test we get that

\displaystyle \textup{limsup}_{i \to \infty} \left | \frac{\frac{(i+1)^n}{2^{i+1}}}{\frac{i^n}{2^i}} \right | = \limsup_{i\to\infty} \frac{\left(\frac{i+1}{i}\right)^n}{2} = \frac{1}{2} < 1,

thus the series is convergent.

Now that we proved that our algorithm works, it’s time to try it! Let’s say that we want to simulate a coin which gives “heads” with probability 1/3.
We need to construct our binary digit stream. Since 1/3 is 0.010101… in binary, we could use the following simple generator:

def oneThird():
   while True:
      yield 0
      yield 1

However, we might want to have a more general generator that gives us the binary representation of any number. The following function, which takes a number between 0 and 1 as its argument, does the job:

def binaryDigits(fraction):
   while True:
      fraction *= 2
      yield int(fraction)
      fraction = fraction % 1

We also need a fair coin simulator. For this simulation, let’s just use Python’s built-in pseudo-random number generator:

def fairCoin():
   return random.choice([0,1])

Let us toss our biased coin 10000 times and take the sum. We expect the sum to be around 3333. Indeed, when I tried

>>> sum(biasedCoin(oneThird(), fairCoin) for i in range(10000))
3330

It might be worth noting oneThird() is approximately ten times faster than binaryDigits(fractions.Fraction(1,3)), so when a large number of biased coins is needed, you can hardwire the binary representation of p into the program.

Simulating a Fair Coin with a Biased Coin

This is a guest post by my friend and colleague Adam Lelkes. Adam’s interests are in algebra and theoretical computer science. This gem came up because Adam gave a talk on probabilistic computation in which he discussed this technique.

Problem: Simulate a fair coin given only access to a biased coin.

Solution: (in Python)

def fairCoin(biasedCoin):
   coin1, coin2 = 0,0
   while coin1 == coin2:
      coin1, coin2 = biasedCoin(), biasedCoin()
   return coin1

Discussion: This is originally von Neumann’s clever idea. If we have a biased coin (i.e. a coin that comes up heads with probability different from 1/2), we can simulate a fair coin by tossing pairs of coins until the two results are different. Given that we have different results, the probability that the first is “heads” and the second is “tails” is the same as the probability of “tails” then “heads”. So if we simply return the value of the first coin, we will get “heads” or “tails” with the same probability, i.e. 1/2.

Note that we did not have to know or assume anything about our biasedCoin function other than it returns 0 or 1 every time, and the results between function calls are independent and identically distributed. In particular, we do not need to know the probability of getting 1. (However, that probability should be strictly between 0 or 1.) Also, we do not use any randomness directly, only through the biasedCoin function.

Here is a simple simulation:

from random import random
def biasedCoin():
   return int(random() < 0.2)

This function will return 1 with probability 0.2. If we try

sum(biasedCoin() for i in range(10000))

with high probability we will get a number that is close to 2000. I got 2058.

On the other hand, if we try

sum(fairCoin(biasedCoin) for i in range(10000))

we should see a value that is approximately 5000. Indeed, when I tried it, I got 4982, which is evidence that fairCoin(biasedCoin) returns 1 with probability 1/2 (although I already gave a proof!).

One might wonder how many calls to biasedCoin we expect to make before the function returns. One can recognize the experiment as a geometric distribution and use the known expected value, but it is short so here is a proof. Let s be the probability of seeing two different outcomes in the biased coin flip, and t the expected number of trials until that happens. If after two flips we see the same outcome (HH or TT), then by independence the expected number of flips we need is unchanged. Hence

t = 2s + (1-s)(2 + t)

Simplifying gives t = 2/s, and since we know s = 2p(1-p) we expect to flip the coin \frac{1}{p(1-p)} times.

For a deeper dive into this topic, see these notes by Michael Mitzenmacher from Harvard University. They discuss strategies for simulating a fair coin from a biased coin that are optimal in the expected number of flips required to run the experiment once. He has also written a book on the subject of randomness in computing.

Probably Approximately Correct — a Formal Theory of Learning

In tackling machine learning (and computer science in general) we face some deep philosophical questions. Questions like, “What does it mean to learn?” and, “Can a computer learn?” and, “How do you define simplicity?” and, “Why does Occam’s Razor work? (Why do simple hypotheses do well at modelling reality?)” In a very deep sense, learning theorists take these philosophical questions — or at least aspects of them — give them fleshy mathematical bodies, and then answer them with theorems and proofs. These fleshy bodies might have imperfections or they might only address one small part of a big question, but the more we think about them the closer we get to robust answers and, as a reader of this blog might find relevant, useful applications. But the glamorous big-picture stuff is an important part of the allure of learning theory.

But before we jump too far ahead of ourselves, we need to get through the basics. In this post we’ll develop the basic definitions of the theory of PAC-learning. It will be largely mathematical, but fear not: we’ll mix in a wealth of examples to clarify the austere symbols.

Leslie Valiant

Leslie Valiant

Some historical notes: PAC learning was invented by Leslie Valiant in 1984, and it birthed a new subfield of computer science called computational learning theory and won Valiant some of computer science’s highest awards. Since then there have been numerous modifications of PAC learning, and also models that are entirely different from PAC learning. One other goal of learning theorists (as with computational complexity researchers) is to compare the power of different learning models. We’ll discuss this more later once we have more learning models under our belts.

So let’s jump right in and see what this award-winning definition is all about.

Learning Intervals

The core idea of PAC-learnability is easy to understand, and we’ll start with a simple example to explain it. Imagine a game between two players. Player 1 generates numbers x at random in some fixed way, and in Player 1′s mind he has an interval [a,b]. Whenever Player 1 gives out an x, he must also say whether it’s in the interval (that is, whether a \leq x \leq b). Let’s say that Player 1 gives reports a 1 if x is in the interval, and a 0 otherwise. We’ll call this number the label of x, and call the pair of (x, label) a sample, or an example. We recognize that the zero and one correspond to “yes” and “no” answers to some question (Is this email spam? Does the user click on my ad? etc.), and so sometimes the labels are instead \pm 1, and referred to as “positive” or “negative” examples. We’ll use the positive/negative terminology here, so positive is a 1 and negative is a 0.

Player 2 (we’re on her side) sees a bunch of samples and her goal is to determine a and b. Of course Player 2 can’t guess the interval exactly if the endpoints are real numbers, because Player 1 only gives out finitely many samples. But whatever interval Player 2 does guess at the end can be tested against Player 1′s number-producing scheme. That is, we can compute the probability that Player 2′s interval will give an incorrect label if Player 1 were to continue giving out numbers indefinitely. If this error is small (taking into account how many samples were given), then Player 2 has “learned” the interval. And if Player 2 plays this game over and over and usually wins (no matter what strategy or interval Player 1 decides to use!), then we say this problem is PAC-learnable.

PAC stands for Probably Approximately Correct, and our number guessing game makes it clear what this means. Approximately correct means the interval is close enough to the true interval that the error will be small on new samples, and Probably means that if we play the game over and over we’ll usually be able to get a good approximation. That is, we’ll find an approximately good interval with high probability

Indeed, one might already have a good algorithm in mind to learn intervals. Simply take the largest and smallest positive examples and use those as the endpoints of your interval. It’s not hard to see why this works, but if we want to prove it (or anything) is PAC-learnable, then we need to solidify these ideas with mathematical definitions.

Distributions and Hypotheses

First let’s settle the random number generation scheme. In full generality, rather than numbers we’ll just have some set X. Could be finite, could be infinite, no restrictions. And we’re getting samples randomly generated from X according to some fixed but arbitrary and unknown distribution D. To be completely rigorous, the samples are independent and identically distributed (they’re all drawn from the same D and independently so). This is Player 1′s dastardly decision: how should he pick his method to generate random numbers so as to bring Player 2′s algorithm to the most devastating ruin?

So then Player 1 is reduced to a choice of distribution D over X, and since we said that Player 2′s algorithm has to do well with high probability no matter what D is, then the definition becomes something like this:

A problem is PAC-learnable if there is an algorithm A which will likely win the game for all distributions D over X.

Now we have to talk about how “intervals” fit in to the general picture. Because if we’re going to talk about learning in general, we won’t always be working with intervals to make decisions. So we’re really saying that Player 1 picks some function c for classifying points in X as a 0 or a 1. We’ll call this a concept, or a target, and it’s the thing Player 2 is trying to learn. That is, Player 2 is producing her own function h that also labels points in X, and we’re comparing it to c. We call a function generated by Player 2 a hypothesis (hence the use of the letter h).

And how can we compare them? Well, we can compute the probability that they differ. We call this the error:

\displaystyle \textup{err}_{c,D}(h) = \textup{P}_D(h(x) \neq c(x))

One would say this aloud: “The error of the hypothesis h with respect to the concept c and the distribution D is the probability over x drawn via D that h(x) and c(x) differ”. Some might write the “differ” part as the symmetric difference of the two functions as sets. And then it becomes a probability density, if that’s your cup of tea (it’s not mine).

So now for a problem to be PAC-learnable we can say something like,

A problem is PAC-learnable if there is an algorithm A which for any distribution D and any concept c will, when given some independently drawn samples and with high probability, produce a hypothesis whose error is small.

There are still a few untrimmed hedges in this definition (like “some,” “small,” and “high”), but there’s still a more important problem: there’s just too many possible concepts! Even for finite sets: there are 2^n \left \{ 0,1 \right \}-valued functions on a set of n elements, and if we hope to run in polynomial time we can only possible express a miniscule fraction of those functions. Going back to the interval game, it’d be totally unreasonable to expect Player 2 to be able to get a reasonable hypothesis (using intervals or not!) if Player 1′s chosen concept is arbitrary. (The mathematician in me is imaging some crazy rule using non-measurable sets, but just suffice it to say: you might think you know everything about the real numbers, but you don’t.)

So we need to boil down what possibilities there are for the concepts c and the allowed expressive power of the learner. This is what concept classes are for.

Concept Classes

concept class \mathsf{C} over X is a family of functions X \to \left \{ 0,1 \right \}. That’s all.

No, okay, there’s more to the story, but for now it’s just a shift of terminology. Now we can define the class of labeling functions induced by a choice of intervals. One might do this by taking \mathsf{C} to be the set of all characteristic functions of intervals, \chi_{[a,b]}(x) = 1 if a \leq x \leq b and 0 otherwise. Now the concept class becomes the sole focus of our algorithm. That is, the algorithm may use knowledge of the concept class to produce its hypotheses. So our working definition becomes:

A concept class \mathsf{C} is PAC-learnable if there is an algorithm A which, for any distribution D of samples and any concept c \in \mathsf{C}, will with high probability produce a hypothesis h \in \mathsf{C} whose error is small.

As a short prelude to future posts: we’ll be able to prove that, if the concept class is sufficiently simple (think, “low dimension”) then any algorithm that does something reasonable will be able to learn the concept class. But that will come later. Now we turn to polishing the rest of this definition.

Probably Approximately Correct Learning

We don’t want to phrase the definition in terms of games, so it’s time to remove the players from the picture. What we’re really concerned with is whether there’s an algorithm which can produce good hypotheses when given random data. But we have to solidify the “giving” process and exactly what limits are imposed on the algorithm.

It sounds daunting, but the choices are quite standard as far as computational complexity goes. Rather than say the samples come as a big data set as they might in practice, we want the algorithm to be able to decide how much data it needs. To do this, we provide it with a query function which, when accessed, spits out a sample in unit time. Then we’re interested in learning the concept with a reasonable number of calls to the query function.

And now we can iron out those words like “some” and “small” and “high” in our working definition. Since we’re going for small error, we’ll introduce a parameter 0 < \varepsilon < 1/2 to represent our desired error bound. That is, our goal is to find a hypothesis h such that \textup{err}_{c,D}(h) \leq \varepsilon with high probability. And as \varepsilon gets smaller and smaller (as we expect more and more of it), we want to allow our algorithm more time to run, so we limit our algorithm to run in time and space polynomial in 1/\varepsilon.

We need another parameter to control the “high probability” part as well, so we’ll introduce 0 < \delta < 1/2 to represent the small fraction of the time we allow our learning algorithm to have high error. And so our goal becomes to, with probability at least 1-\delta, produce a hypothesis whose error is less than \varepsilon. In symbols, we want

\textup{P}_D (\textup{err}_{c,D}(h) \leq \varepsilon) > 1 - \delta

Note that the \textup{P}_D refers to the probability over which samples you happen to get when you call the query function (and any other random choices made by the algorithm). The “high probability” hence refers to the unlikely event that you get data which is unrepresentative of the distribution generating it. Note that this is not the probability over which distribution is chosen; an algorithm which learns must still learn no matter what D is.

And again as we restrict \delta more and more, we want the algorithm to be allowed more time to run. So we require the algorithm runs in time polynomial in both 1/\varepsilon, 1/\delta.

And now we have all the pieces to state the full definition.

Definition: Let X be a set, and \mathsf{C} be a concept class over X. We say that \mathsf{C} is PAC-learnable if there is an algorithm A(\varepsilon, \delta) with access to a query function for \mathsf{C} and runtime O(\textup{poly}(\frac{1}{\varepsilon}, \frac{1}{\delta})), such that for all c \in \mathsf{C}, all distributions D over X, and all inputs \varepsilon, \delta between 0 and 1/2, the probability that A produces a hypothesis h with error at most \varepsilon is at least 1- \delta. In symbols,

\displaystyle \textup{P}_{D}(\textup{P}_{x \sim D}(h(x) \neq c(x)) \leq \varepsilon) \geq 1-\delta

where the first \textup{P}_D is the probability over samples drawn from D during the execution of the program to produce h. Equivalently, we can express this using the error function,

\displaystyle \textup{P}_{D}(\textup{err}_{c,D}(h) \leq \varepsilon) \geq 1-\delta

Excellent.

Intervals are PAC-Learnable

Now that we have this definition we can return to our problem of learning intervals on the real line. Our concept class is the set of all characteristic functions of intervals (and we’ll add in the empty set for the default case). And the algorithm we proposed to learn these intervals was quite simple: just grab a bunch of sample points, take the biggest and smallest positive examples, and use those as the endpoints of your hypothesis interval.

Let’s now prove that this algorithm can learn any interval with any distribution over real numbers. This proof will have the following form:

  • Leave the number of samples you pick arbitrary, say m.
  • Figure out the probability that the total error of our produced hypothesis is > \varepsilon in terms of m.
  • Pick m to be sufficiently large that this event (failing to achieve low error) happens with small probability.

So fix any distribution D over the real line and say we have our m samples, we picked the max and min, and our interval is I = [a_1,b_1] when the target concept is J = [a_0, b_0]. We can notice one thing, that our hypothesis is contained in the true interval, I \subset J. That’s because the sample never lie, so the largest sample we saw must be smaller than the largest possible positive example, and vice versa. In other words a_0 < a_1 < b_1 < b_0. And so the probability of our hypothesis producing an error is just the probability that D produces a positive example in the two intervals A = [a_0, a_1], B = [b_1, b_0].

This is all setup for the second bullet point above. The total error is at most the sum of the probabilities that a positive sample shows up in each of A, B separately.

\displaystyle \textup{err}_{J, D} \leq \textup{P}_{x \sim D}(x \in A) + \textup{P}_{x \sim D}(x \in B)

Here’s a picture.

The two green intervals are our regions where error can occur.

The two green intervals are the regions where error can occur.

If we can guarantee that each of the green pieces is smaller than \varepsilon / 2 with high probability, then we’ll be done. Let’s look at A, and the same argument will hold for B. Define A' to be the interval [a_0, y] which is so big that the probability that a positive example is drawn from A' under D is exactly \varepsilon / 2. Here’s another picture to clarify that.

The pink region A' has total probability weight epsilon/2, and if the green region A is larger, we risk too much error to be PAC-learnable.

The pink region A’ has total probability weight epsilon/2, and if the green region A is larger, we risk too much error to be PAC-learnable.

We’ll be in great shape if it’s already the case that A \subset A', because that implies the probability we draw a positive example from A is at most \varepsilon / 2. So we’re worried about the possibility that A' \subset A. But this can only happen if we never saw a point from A' as a sample during the run of our algorithm. Since we had m samples, we can compute in terms of m the probability of never seeing a sample from A'.

The probability of a single sample not being in A' is just 1 - \varepsilon/2 (by definition!). Recalling our basic probability theory, two draws are independent events, and so the probability of missing A' m times is equal to the product of the probabilities of each individual miss. That is, the probability that our chosen A contributes error greater than \varepsilon / 2 is at most

\displaystyle \textup{P}_D(A' \subset A) \leq (1 - \varepsilon / 2)^m

The same argument applies to B, so we know by the union bound that the probability of error > \varepsilon / 2 occurring in either A or B is at most the sum of the probabilities of large error in each piece, so that

\displaystyle \textup{P}_D(\textup{err}_{J,D}(I) > \varepsilon) \leq 2(1 - \varepsilon / 2)^m

Now for the third bullet. We want the chance that the error is big to be smaller than \delta, so that we’ll have low error with probability > 1 - \delta. So simply set

\displaystyle 2(1 - \varepsilon / 2)^m \leq \delta

And solve for m. Using the fact that (1-x) \leq e^{-x} (which is proved by Taylor series), it’s enough to solve

\displaystyle 2e^{-\varepsilon m/2} \leq \delta,

And a fine solution is m \geq (2 / \varepsilon \log (2 / \delta)).

Now to cover all our bases: our algorithm simply computes m for its inputs \varepsilon, \delta, queries that many samples, and computes the tightest-fitting interval containing the positive examples. Since the number of samples is polynomial in 1/\varepsilon, 1/\delta (and our algorithm doesn’t do anything complicated), we comply with the time and space bounds. And finally we just proved that the chance our algorithm will misclassify an \varepsilon fraction of new points drawn from D is at most \delta. So we have proved the theorem:

Theorem: Intervals on the real line are PAC-learnable.

\square

As an exercise, see if you can generalize the argument to axis-aligned rectangles in the plane. What about to arbitrary axis-aligned boxes in d dimensional space? Where does d show up in the number of samples needed? Is this still efficient?

Comments and Previews

There are a few more technical details we’ve ignored in the course of this post, but the important idea are clear. We have a formal model of learning which allows for certain pre-specified levels of imperfection, and we proved that one can learn how to recognize intervals in this model. It’s a far cry from decision trees and neural networks, but it’s a solid foundation to build upon.

However PAC-learning is far from sacred. In particular, the choice that we require a single algorithm to succeed no matter what the distribution D was a deliberate choice, and it’s quite a strong requirement for a learning algorithm. As we’ll hopefully see in the future, distributions can be defined in quite arbitrary and pathological ways just for the singular purpose of showing a concept class isn’t efficiently PAC-learnable (or that an algorithm fails to show it is learnable). That is, even if the algorithm works great in practice! This is where learning theory starts to butt heads with the real world; in this respect PAC-learnability is much more of a mathematical subject than a practical one. We’re interested in knowing how this particular definition of what it means to learn works.

And so the kinds of questions we ask are: can we classify all PAC-learnable problems? Can we find a meta-algorithm that would work on any PAC-learnable concept class given some assumptions? How does PAC-learning relate to other definitions of learning? Say, one where we don’t require it to work for every distribution; would that really allow us to solve more problems?

It’s a question of finding out the deep truths of mathematics now, but we promise that this series will soon come back around to practical applications, for learning theory naturally entails the design and analysis of fascinating algorithms.

Until next time!