The Giant Component and Explosive Percolation

Last time we left off with a tantalizing conjecture: a random graph with edge probability p = 5/n is almost surely a connected graph. We arrived at that conjecture from some ad-hoc data analysis, so let’s go back and treat it with some more rigorous mathematical techniques. As we do, we’ll discover some very interesting “threshold theorems” that essentially say a random graph will either certainly have a property, or it will certainly not have it.


The phase transition we empirically observed from last time.

Big components

Recalling the basic definition: an Erdős-Rényi (ER) random graph with n vertices and edge probability p is a probability distribution over all graphs on n vertices. Generatively, you draw from an ER distribution by flipping a p-biased coin for each pair of vertices, and adding the edge if you flip heads. We call the random event of drawing a graph from this distribution a “random graph” even though it’s not a graph, and we denote an ER random graph by G(n,p). When p = 1/2, the distribution G(n,1/2) is the uniform distribution over all graphs on n vertices.

Now let’s get to some theorems. The main tools we’ll use are called the first and second moment method. Let’s illustrate them by example.

The first moment method

Say we want to know what values of p are likely to produce graphs with isolated vertices (vertices with no neighbors), and which are not. Of course, the value of p will depend on n \to \infty in general, but we can already see by example that if p = 1/2 then the probability of a fixed vertex being isolated is 2^{-n} \to 0. We can use the union bound (sum this value over all vertices) to show that the probability of any vertex being isolated is at most n2^{-n} which also tends to zero very quickly. This is not the first moment method, I’m just making the point that all of our results will be interpreted asymptotically as n \to \infty.

So now we can ask: what is the expected number of isolated vertices? If I call X the random variable that counts the expected number of isolated vertices, then I’m asking about \mathbb{E}[X]. Really what I’m doing is interpreting X as a random variable depending on n, p(n), and asking about the evolution of \mathbb{E}[X] as n \to \infty.

Now the first moment method states, somewhat obviously, that if the expectation tends to zero then the value of X itself also tends to zero. Indeed, this follows from Markov’s inequality, which states that the probability that X \geq a is bounded by \mathbb{E}[X]/a. In symbols,

\displaystyle \Pr[X \geq a] \leq \frac{\mathbb{E}[X]}{a}.

In our case X is counting something (it’s integer valued), so asking whether X > 0 is equivalent to asking whether X \geq 1. The upper bound on the probability of X being strictly positive is then just \mathbb{E}[X].

So let’s find out when the expected number of isolated vertices goes to zero. We’ll use the wondrous linearity of expectation to split X into a sum of counts for each vertex. That is, if X_i is 1 when vertex i is isolated and 0 otherwise (this is called an indicator variable), then X = \sum_{i=1}^n X_i and linearity of expectation gives

\displaystyle \mathbb{E}[X] = \mathbb{E}[\sum_{i=1}^n X_i] = \sum_{i=1}^n \mathbb{E}[X_i]

Now the expectation of an indicator random variable is just the probability that the event occurs (it’s trivial to check). It’s easy to compute the probability that a vertex is isolated: it’s (1-p)^n. So the sum above works out to be n(1-p)^n. It should really be n(1-p)^{n-1} but the extra factor of (1-p) doesn’t change anything. The question is what’s the “smallest” way to set p as a function of n in order to make the above thing go to zero? Using the fact that (1-x) < e^{-x} for all x > 0, we get

n(1-p)^n < ne^{-pn}

And setting p = (\log n) / n simplifies the right hand side to ne^{- \log n} = n / n = 1. This is almost what we want, so let’s set p to be anything that grows asymptotically faster than (\log n) / n. The notation for this is \omega((\log n) / n). Then using some slick asymptotic notation we can prove that the RHS of the inequality above goes to zero, and so the LHS must as well. Back to the big picture: we just showed that the expectation of X (the expected number of isolated vertices) goes to zero, and so by the first moment method the value of X (the actual number of isolated vertices) has to go to zero with probability tending to 1.

Some quick interpretations: when p = (\log n) / n each vertex has \log n neighbors in expectation. Moreover, having no isolated vertices is just a little bit short of the entire graph being connected (our ultimate goal is to figure out exactly when this happens). But already we can see that our conjecture from the beginning is probably false: we aren’t able to use this same method to show that when p = c/n for some constant c rules out isolated vertices as n \to \infty. We just got lucky in our data analysis that 5 is about the natural log of 100 (which is 4.6).

The second moment method

Now what about the other side of the coin? If p is asymptotically less than (\log n) / n do we necessarily get isolated vertices? That would really put our conjecture to rest. In this case the answer is yes, but it might not be in general. Let’s discuss.

We said that in general if \mathbb{E}[X] \to 0 then the value of X has to go to zero too (that’s the first moment method). The flip side of this is: if \mathbb{E}[X] \to \infty does necessarily the value of X also tend to infinity? The answer is not always yes. Here is a gruesome example I originally heard from a book: say X is the number of people that will die in the next decade due to an asteroid hitting the earth. The probability that the event happens is quite small, but if it does happen then the number of people that will die is quite large. It is perfectly reasonable for this to drag up the expectation (as the world population grows every decade), but at least we hope a growing population doesn’t by itself increase the value of X.

Mathematics is on our side here. We’re asking under what conditions on \mathbb{E}[X] does the following implication hold: \mathbb{E}[X] \to \infty implies \Pr[X > 0] \to 1.

With the first moment method we used Markov’s inequality (a statement about expectation, also called the first moment). With the second moment method we’ll use a statement about the second moment (variances), and the most common is Chebyshev’s inequality. Chebyshev’s inequality states that the probability X deviates from its expectation by more than c is bounded by \textup{Var}[X] / c^2. In symbols, for all c > 0 we have

\displaystyle \Pr[|X - \mathbb{E}[X]| \geq c] \leq \frac{\textup{Var}[X]}{c^2}

Now the opposite of X > 0, written in terms of deviation from expectation, is |X - \mathbb{E}[X]| \geq \mathbb{E}[X]. In words, in order for any number a to be zero, it has to have a distance of at least b from any number b. It’s such a stupidly simple statement it’s almost confusing. So then we’re saying that

\displaystyle \Pr[X = 0] \leq \frac{\textup{Var}[X]}{\mathbb{E}[X]^2}.

In order to make this probability go to zero, it’s enough to have \textup{Var}[X] = o(\mathbb{E}[X]^2). Again, the little-o means “grows asymptotically slower than.” So the numerator of the fraction on the RHS will grow asymptotically slower than the denominator, meaning the whole fraction tends to zero. This condition and its implication are together called the “second moment method.”

Great! So we just need to compute \textup{Var}[X] and check what conditions on p make it fit the theorem. Recall that \textup{Var}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2, and we want to upper bound this in terms of \mathbb{E}[X]^2. Let’s compute \mathbb{E}[X]^2 first.

\displaystyle \mathbb{E}[X]^2 = n^2(1-p)^{2n}

Now the variance.

\displaystyle \textup{Var}[X] = \mathbb{E}[X^2] - n^2(1-p)^{2n}

Expanding X as a sum of indicator variables X_i for each vertex, we can split the square into a sum over pairs. Note that X_i^2 = X_i since they are 0-1 valued indicator variables, and X_iX_j is the indicator variable for both events happening simultaneously.

\displaystyle \begin{aligned} \mathbb{E}[X^2] &= \mathbb{E}[\sum_{i,j} X_{i,j}] \\ &=\mathbb{E} \left [ \sum_i X_i^2 + \sum_{i \neq j} X_iX_j \right ] \\ &= \sum_i \mathbb{E}[X_i^2] + \sum_{i \neq j} \mathbb{E}[X_iX_j] \end{aligned}

By what we said about indicators, the last line is just

\displaystyle \sum_i \Pr[i \textup{ is isolated}] + \sum_{i \neq j} \Pr[i,j \textup{ are both isolated}]

And we can compute each of these pieces quite easily. They are (asymptotically ignoring some constants):

\displaystyle n(1-p)^n + n^2(1-p)(1-p)^{2n-4}

Now combining the two terms together (subtracting off the square of the expectation),

\displaystyle \begin{aligned} \textup{Var}[X] &\leq n(1-p)^n + n^2(1-p)^{-3}(1-p)^{2n} - n^2(1-p)^{2n} \\ &= n(1-p)^n + n^2(1-p)^{2n} \left ( (1-p)^{-3} - 1 \right ) \end{aligned}

Now we divide by \mathbb{E}[X]^2 to get n^{-1}(1-p)^{-n} + (1-p)^{-3} - 1. Since we’re trying to see if p = (\log n) / n is a sharp threshold, the natural choice is to let p = o((\log n) / n). Indeed, using the 1-x < e^{-x} upper bound and plugging in the little-o bounds the whole quantity by

\displaystyle \frac{1}{n}e^{o(\log n)} + o(n^{1/n}) - 1 = o(1)

i.e., the whole thing tends to zero, as desired.

Other thresholds

So we just showed that the property of having no isolated vertices in a random graph has a sharp threshold at p = (\log n) / n. Meaning at any larger probability the graph is almost surely devoid of isolated vertices, and at any lower probability the graph almost surely has some isolated vertices.

This might seem like a miracle theorem, but there turns out to be similar theorems for lots of properties. Most of them you can also prove using basically the same method we’ve been using here. I’ll list some below. Also note they are all sharp, two-sided thresholds in the same way that the isolated vertex boundary is.

  • The existence of a component of size \omega(\log (n)) has a threshold of 1/n.
  • p = c/n for any c > 0 is a threshold for the existence of a giant component of linear size \Theta(n). Moreover, above this threshold no other components will have size \omega(\log n).
  • In addition to (\log n) / n being a threshold for having no isolated vertices, it is also a threshold for connectivity.
  • p = (\log n + \log \log n + c(n)) / n is a sharp threshold for the existence of Hamiltonian cycles in the following sense: if c(n) = \omega(1) then there will be a Hamilton cycle almost surely, if c(n) \to -\infty there will be no Hamiltonian cycle almost surely, and if c(n) \to c the probability of a Hamiltonian cycle is e^{-e^{-c}}. This was proved by Kolmos and Szemeredi in 1983. Moreover, there is an efficient algorithm to find Hamiltonian cycles in these random graphs when they exist with high probability.

Explosive Percolation

So now we know that as the probability of an edge increases, at some point the graph will spontaneously become connected; at some time that is roughly \log(n) before, the so-called “giant component” will emerge and quickly engulf the entire graph.

Here’s a different perspective on this situation originally set forth by Achlioptas, D’Souza, and Spencer in 2009. It has since become called an “Achlioptas process.”

The idea is that you are watching a random graph grow. Rather than think about random graphs as having a probability above or below some threshold, you can think of it as the number of edges growing (so the thresholds will all be multiplied by n). Then you can imagine that you start with an empty graph, and at every time step someone is adding a new random edge to your graph. Fine, eventually you’ll get so many edges that a giant component emerges and you can measure when that happens.

But now imagine that instead of being given a single random new edge, you are given a choice. Say God presents you with two random edges, and you must pick which to add to your graph. Obviously you will eventually still get a giant component, but the question is how long can you prevent it from occurring? That is, how far back can we push the threshold for connectedness by cleverly selecting the new edge?

What Achlioptas and company conjectured was that you can push it back (some), but that when you push it back as far as it can go, the threshold becomes discontinuous. That is, they believed there was a constant \delta \geq 1/2 such that the size of the largest component jumps from o(n) to \delta n in o(n) steps.

This turned out to be false, and Riordan and Warnke proved it. Nevertheless, the idea has been interpreted in an interesting light. People have claimed it is a useful model of disaster in the following sense. If you imagine that an edge between two vertices is a “crisis” relating two entities. Then in every step God presents you with two crises and you only have the resources to fix one. The idea is that when the entire graph is connected, you have this one big disaster where all the problems are interacting with each other. The percolation process describes how long you can “survive” while avoiding the big disaster.

There are critiques of this interpretation, though, mainly about how simplistic it is. In particular, an Achlioptas process models a crisis as an exogenous force when in reality problems are usually endogenous. You don’t expect a meteor to hit the Earth, but you do expect humans to have an impact on the environment. Also, not everybody in the network is trying to avoid errors. Some companies thrive in economic downturns by managing your toxic assets, for example. So one could reasonably argue that Achlioptas processes aren’t complex enough to model the realistic types of disasters we face.

Either way, I find it fantastic that something like a random graph (which for decades was securely in pure combinatorics away from applications) is spurring such discussion.

Next time, we’ll take one more dive into the theory of Erdős-Rényi random graphs to prove a very “meta” theorem about sharp thresholds. Then we’ll turn our attention to other models of random graphs, hopefully more realistic ones :)

Until then!

Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability p. If X denotes the number of trials needed to get the first success, then clearly \Pr(X = k) = (1-p)^{k-1} p (since first we need k-1 failures which occur independently with probability 1-p, then we need one success which happens with probability p). Thus the expected value of X is

\displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p

by basic calculus. In particular, if success is defined as getting a six, then p=1/6 thus the expected time is 1/p=6.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is \frac{1}{26}, thus the probability of a random eleven-letter word being ABRACADABRA is exactly \left(\frac{1}{26}\right)^{11}. So if typing 11 letters is one trial, the expected number of trials is

\displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}

which means 11\cdot 26^{11} keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that 11\cdot 26^{11} is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses $1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive $26 if he wins, since the probability of getting the next letter right is exactly \frac{1}{26} (thus the expected value of the change in the gambler’s fortune is \frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as T. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly T dollars.

How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins $26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins \$26^4. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be \$26^{11}. Thus our casino will have to give out 26^{11}+26^4+26 dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is T dollars, the expected value of our expenses is 26^{11}+26^4+26 dollars, thus E(T)=26^{11}+26^4+26. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. X_0 will denote the gambler’s fortune before the game starts, X_1 the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of X_n, given X_1, X_2, \ldots, X_{n-1} is the same as X_{n-1}. This can be written as E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1} or, equivalently, E(X_n - X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let X'_n = X_n if n < T and X'_n = X_T if n \ge T where T denotes the stopping time, i.e. the time at which the desired event occurs. Notice that T itself is a random variable.

We require our stopping time T to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let X_n be a martingale stopped at step T, and suppose one of the following three conditions hold:

  1. The stopping time T is almost surely bounded by some constant;
  2. The stopping time T is almost surely finite and every step of the stopped martingale X_n is almost surely bounded by some constant; or
  3. The expected stopping time E(T) is finite and the absolute value of the martingale increments |X_n-X_{n-1}| are almost surely bounded by a constant.

Then E(X_T) = E(X_0).

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than 26^{12}) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by 26^{11}+26^4+26. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has a dollars, the second player has b dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let X_n denote the change in the second player’s fortune, and set X_0 = 0. Let T_k denote the first time s when X_s = k. Then our first question can be formalized as trying to determine \Pr(T_{-b} < T_a). Let t = \min \{ T_{-b}, T_a\}. Clearly t is a stopping time. By the optional stopping theorem we have that

\displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))

thus \Pr(T_{-b} < T_a)=\frac{a}{a+b}.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: X_n^2-n is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with k-SAT for some k \ne 3 instead? Clearly the hardness of the problem is monotone increasing in k since k-SAT is a special case of (k+1)-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as k-SAT for any k>3. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals x and y are connected by a directed edge if there is a clause (\bar x \lor y). Recall that \bar x \lor y is equivalent to x \implies y, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths x \to \bar x and \bar x \to x (since x \Leftrightarrow \bar x is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after O(n^2) rounds where n denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in O(n^2) steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by \pm 1 in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least \frac{1}{2}. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond n. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number n, on the other hand, is called a reflecting barrier: we cannot reach n+1, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in O(n^2) steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

Optimism in the Face of Uncertainty: the UCB1 Algorithm


The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of K actions labeled by the integers \left \{ 1, 2, \dots, K \right \}. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

Exploit away, you lucky ladies.

Exploit away, you lucky ladies.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action i be drawn from a fixed distribution D_i (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm A and a set of actions \left \{1, 2, \dots, K \right \}, the cumulative regret of A in rounds 1, \dots, T is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of A for the first T rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.


Let’s go ahead and draw up some notation. As before the actions are labeled by integers \left \{ 1, \dots, K \right \}. The reward of action i is a [0,1]-valued random variable X_i distributed according to an unknown distribution and possessing an unknown expected value \mu_i. The game progresses in rounds t = 1, 2, \dots so that in each round we have different random variables X_{i,t} for the reward of action i in round t (in particular, X_{i,t} and X_{i,s} are identically distributed). The X_{i,t} are independent as both t and i vary, although when i varies the distribution changes.

So if we were to play action 2 over and over for T rounds, then the total payoff would be the random variable G_2(T) = \sum_{t=1}^T X_{2,t}. But by independence across rounds and the linearity of expectation, the expected payoff is just \mu_2 T. So we can describe the best action as the action with the highest expected payoff. Define

\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i

We call the action which achieves the maximum i^*.

A policy is a randomized algorithm A which picks an action in each round based on the history of chosen actions and observed rewards so far. Define I_t to be the action played by A in round t and P_i(n) to be the number of times we’ve played action i in rounds 1 \leq t \leq n. These are both random variables. Then the cumulative payoff for the algorithm A over the first T rounds, denoted G_A(T), is just

\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}

and its expected value is simply

\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T)).

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy A after the first T steps, denoted R_A(T) can be written as

\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T)).

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than \Omega(\sqrt{KT}). It is also known that there are algorithms that guarantee no worse than O(\sqrt{KT}) expected regret. The algorithm we’ll see in the next section, however, only guarantees O(\sqrt{KT \log T}). We present it on this blog because of its simplicity and ubiquity in the field.

The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action \mu_i is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose Y_1, \dots, Y_n are independent random variables whose values lie in [0,1] and whose expected values are \mu_i. Call Y = \frac{1}{n}\sum_{i}Y_i and \mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of Y deviates from its mean. Specifically,

\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}

For us, the Y_i will be the payoff variables for a single action j in the rounds for which we choose action j. Then the variable Y is just the empirical average payoff for action j over all the times we’ve tried it. Moreover, a is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for a to find an upper bound big enough to be confident that we’re within a of the true mean.

Indeed, if we call n_j the number of times we played action j thus far, then n = n_j in the equation above, and using a = a(j,T) = \sqrt{2 \log(T) / n_j} we get that \textup{P}(Y > \mu + a) \leq T^{-4}, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action j, its upper bound a grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound a(j, T) is getting bigger for actions we haven’t played; any round in which we play an action j, it must be that a(j, T+1) = a(j,T), although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

Play each of the K actions once, giving initial values for empirical mean payoffs \overline{x}_i of each action i.
For each round t = K, K+1, \dots:

Let n_j represent the number of times action j was played so far.
Play the action j maximizing \overline{x}_j + \sqrt{2 \log t / n_j}.
Observe the reward X_{j,t} and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means x_j change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with K actions, each of whose reward distribution X_{i,t} has values in [0,1]. Then its expected cumulative regret after T rounds is at most O(\sqrt{KT \log T}).

Actually, we’ll prove a more specific theorem. Let \Delta_i be the difference \mu^* - \mu_i, where \mu^* is the expected payoff of the best action, and let \Delta be the minimal nonzero \Delta_i. That is, \Delta_i represents how suboptimal an action is and \Delta is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret \mathbb{E}(R_{\textup{UCB1}}(T)) is at most

\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if \Delta_i is small we will require more tries to know that action i is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the 1 + \pi^2 / 3 part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the O(\sqrt{KT \log T}) bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function R(\Delta_1, \dots, \Delta_K) (excluding of course \Delta^*), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}

and it’s zero if and only if for each i, \Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T}). However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the \Delta_i = O(\sqrt{\log T}) (which are all the same) gives a total bound of O(K \sqrt{\log T}). If we look at the only possible endpoint (the \Delta_i = 1), then we get a local maximum of O(K \sqrt{\log T}). But this isn’t the O(\sqrt{KT \log T}) we promised, what gives? Well, this upper bound grows arbitrarily large as the \Delta_i go to zero. But at the same time, if all the \Delta_i are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the \Delta_i = \Delta are the same, then another trivial regret bound is \Delta T (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at \Delta = O(\sqrt{(K \log T) / T}). Hence the regret bound at the switch is \Delta T = O(\sqrt{KT \log T}), as desired.

Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on P_i(T), the expected number of times UCB chooses an action up to round T. Using the \Delta notation, the regret is then just \sum_i \Delta_i \mathbb{E}(P_i(T)), and bounding the P_i‘s will bound the regret.

Recall the notation for our upper bound a(j, T) = \sqrt{2 \log T / P_j(T)} and let’s loosen it a bit to a(y, T) = \sqrt{2 \log T / y} so that we’re allowed to “pretend” a action has been played y times. Recall further that the random variable I_t has as its value the index of the machine chosen. We denote by \chi(E) the indicator random variable for the event E. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., \overline{x}^* is the empirical mean of the optimal action).

Indeed for any action i, the only way we know how to write down P_i(T) is as

\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action i. Now we’re just going to pull some number m-1 of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than m times overall, this requires an inequality.

P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)

These indicator functions should be read as sentences: we’re just saying that we’re picking action i in round t and we’ve already played i at least m times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action i in round t means that the upper bound for action i exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and i might coincide with the best action, but that’s fine). In notation this event is

\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)

Denote the upper bound \overline{x}_i + a(i,t) for action i in round t by U_i(t). Since this event must occur every time we pick action i (though not necessarily vice versa), we have

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)

We’ll do this process again but with a slightly more complicated event. If the upper bound of action i exceeds that of the optimal machine, it is also the case that the maximum upper bound for action i we’ve seen after the first m trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round t we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine i (except that it’s more than m). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by \overline{x}_{i,s} the random variable for the empirical mean after playing action i a total of s times, and \overline{x}^*_s the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )

Indeed, at each t for which the max is greater than the min, there will be at least one pair s,s' for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs s, s' for which it happens. That is, we can expand the event above into the double sum which is at least as large:

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )

We can make one other odd inequality by increasing the sum to go from t=1 to \infty. This will become clear later, but it means we can replace t-1 with t and thus have

\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t). Then what can we say? Well, consider the following three events:

(1) \displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)
(2) \displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)
(3) \displaystyle \mu^* < \mu_i + 2a(s, t)

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability t^{-4}. Likewise, (2) is the event that the empirical mean payoff of action i is larger than the upper confidence bound, which also occurs with probability t^{-4}. We will see momentarily that (3) is impossible for a well-chosen m (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}


\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}

But putting these two inequalities together gives us precisely that (3) is true:

\mu^* < \mu_i + 2a(s,t)

This proves the claim.

By the union bound, the probability that at least one of these events happens is 2t^{-4} plus whatever the probability of (3) being true is. But as we said, we’ll pick m to make (3) always false. Indeed m depends on which action i is being played, and if s \geq m > 8 \log T / \Delta_i^2 then 2a(s,t) \leq \Delta_i, and by the definition of \Delta_i we have

\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.


Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
   return math.sqrt(2 * math.log(step + 1) / numPlays)

And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
   payoffSums = [0] * numActions
   numPlays = [1] * numActions
   ucbs = [0] * numActions

   # initialize empirical sums
   for t in range(numActions):
      payoffSums[t] = reward(t,t)
      yield t, payoffSums[t], ucbs

   t = numActions

   while True:
      ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
      action = max(range(numActions), key=lambda i: ucbs[i])
      theReward = reward(action, t)
      numPlays[action] += 1
      payoffSums[action] += theReward

      yield action, theReward, ucbs
      t = t + 1

The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from [0,1], biased by 1/k for some 5 \leq k \leq 15. The regret and theoretical regret bound are given in the graph below.


The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!

Probabilistic Bounds — A Primer

Probabilistic arguments are a key tool for the analysis of algorithms in machine learning theory and probability theory. They also assume a prominent role in the analysis of randomized and streaming algorithms, where one imposes a restriction on the amount of storage space an algorithm is allowed to use for its computations (usually sublinear in the size of the input).

While a whole host of probabilistic arguments are used, one theorem in particular (or family of theorems) is ubiquitous: the Chernoff bound. In its simplest form, the Chernoff bound gives an exponential bound on the deviation of sums of random variables from their expected value.

This is perhaps most important to algorithm analysis in the following mindset. Say we have a program whose output is a random variable X. Moreover suppose that the expected value of X is the correct output of the algorithm. Then we can run the algorithm multiple times and take a median (or some sort of average) across all runs. The probability that the algorithm gives a wildly incorrect answer is the probability that more than half of the runs give values which are wildly far from their expected value. Chernoff’s bound ensures this will happen with small probability.

So this post is dedicated to presenting the main versions of the Chernoff bound that are used in learning theory and randomized algorithms. Unfortunately the proof of the Chernoff bound in its full glory is beyond the scope of this blog. However, we will give short proofs of weaker, simpler bounds as a straightforward application of this blog’s previous work laying down the theory.

If the reader has not yet intuited it, this post will rely heavily on the mathematical formalisms of probability theory. We will assume our reader is familiar with the material from our first probability theory primer, and it certainly wouldn’t hurt to have read our conditional probability theory primer, though we won’t use conditional probability directly. We will refrain from using measure-theoretic probability theory entirely (some day my colleagues in analysis will like me, but not today).

Two Easy Bounds of Markov and Chebyshev

The first bound we’ll investigate is almost trivial in nature, but comes in handy. Suppose we have a random variable X which is non-negative (as a function). Markov’s inequality is the statement that, for any constant a > 0,

\displaystyle \textup{P}(X \geq a) \leq \frac{\textup{E}(X)}{a}

In words, the probability that X grows larger than some fixed constant is bounded by a quantity that is inversely proportional to the constant.

The proof is quite simple. Let \chi_a be the indicator random variable for the event that X \geq a (\chi_a = 1 when X \geq a and zero otherwise). As with all indicator random variables, the expected value of \chi_a is the probability that the event happens (if this is mysterious, use the definition of expected value). So \textup{E}(\chi_a) = \textup{P}(X \geq a), and linearity of expectation allows us to include a factor of a:

\textup{E}(a \chi_a) = a \textup{P}(X \geq a)

The rest of the proof is simply the observation that \textup{E}(a \chi_a) \leq \textup{E}(X). Indeed, as random variables we have the inequality a \chi_a \leq X. Whenever a < X, the former is equal to zero while the latter is nonnegative. And whenever a \geq X, the former is precisely a while the latter is by assumption at least a. It follows that \textup{E}(a \chi_a) \leq \textup{E}(X).

This last point is a simple property of expectation we omitted from our first primer. It usually goes by monotonicity of expectation, and we prove it here. First, if X \geq 0 then \textup{E}(X) \geq 0 (this is trivial). Second, if 0 \leq X \leq Y, then define a new random variable Z = Y-X. Since Z \geq 0 and using linearity of expectation, it must be that \textup{E}(Z) = \textup{E}(Y) - \textup{E}(X) \geq 0. Hence \textup{E}(X) \leq \textup{E}(Y). Note that we do require that X has a finite expected value for this argument to work, but if this is not the case then Markov’s inequality is nonsensical anyway.

Markov’s inequality by itself is not particularly impressive or useful. For example, if X is the number of heads in a hundred coin flips, Markov’s inequality ensures us that the probability of getting at least 99 heads is at most 50/99, which is about 1/2. Shocking. We know that the true probability is much closer to 2^{-100}, so Markov’s inequality is a bust.

However, it does give us a more useful bound as a corollary. This bound is known as Chebyshev’s inequality, and its use is sometimes referred to as the second moment method because it gives a bound based on the variance of a random variable (instead of the expected value, the “first moment”).

The statement is as follows.

Chebyshev’s Inequality: Let X be a random variable with finite expected value and positive variance. Then we can bound the probability that X deviates from its expected value by a quantity that is proportional to the variance of X. In particular, for any \lambda > 0,

\displaystyle \textup{P}(|X - \textup{E}(X)| \geq \lambda) \leq \frac{\textup{Var}(X)}{\lambda^2}

And without any additional assumptions on X, this bound is sharp.

Proof. The proof is a simple application of Markov’s inequality. Let Y = (X - \textup{E}(X))^2, so that \textup{E}(Y) = \textup{Var}(X). Then by Markov’s inequality

\textup{P}(Y \geq \lambda^2) \leq \frac{\textup{E}(Y)}{\lambda^2}

Since Y is nonnegative |X - \textup{E}(X)| = \sqrt(Y), and \textup{P}(Y \geq \lambda^2) = \textup{P}(|X - \textup{E}(X)| \geq \lambda). The theorem is proved. \square

Chebyshev’s inequality shows up in so many different places (and usually in rather dry, technical bits), that it’s difficult to give a good example application.  Here is one that shows up somewhat often.

Say X is a nonnegative integer-valued random variable, and we want to argue about when X = 0 versus when X > 0, given that we know \textup{E}(X). No matter how large \textup{E}(X) is, it can still be possible that \textup{P}(X = 0) is arbitrarily close to 1. As a colorful example, let X is the number of alien lifeforms discovered in the next ten years. We might debate that \textup{E}(X) can arbitrarily large: if some unexpected scientific and technological breakthroughs occur tomorrow, we could discover an unbounded number of lifeforms. On the other hand, we are very likely not to discover any, and probability theory allows for such a random variable to exist.

If we know everything about \textup{Var}(X), however, we can get more informed bounds.

Theorem: If \textup{E}(X) \neq 0, then \displaystyle \textup{P}(X = 0) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}.

Proof. Simply choose \lambda = \textup{E}(X) and apply Chebyshev’s inequality.

\displaystyle \textup{P}(X = 0) \leq \textup{P}(|X - \textup{E}(X)| \geq \textup{E}(X)) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}

The first inequality follows from the fact that the only time X can ever be zero is when |X - \textup{E}(X)| = \textup{E}(X), and X=0 only accounts for one such possibility. \square

This theorem says more. If we know that \textup{Var}(X) is significantly smaller than \textup{E}(X)^2, then X > 0 is more certain to occur. More precisely, and more computationally minded, suppose we have a sequence of random variables X_n so that \textup{E}(X_n) \to \infty as n \to \infty. Then the theorem says that if \textup{Var}(X_n) = o(\textup{E}(X_n)^2), then \textup{P}(X_n > 0) \to 1. Remembering one of our very early primers on asymptotic notation, f = o(g) means that f grows asymptotically slower than g, and in terms of this fraction \textup{Var}(X) / \textup{E}(X)^2, this means that the denominator dominates the fraction so that the whole thing tends to zero.

The Chernoff Bound

The Chernoff bound takes advantage of an additional hypothesis: our random variable is a sum of independent coin flips. We can use this to get exponential bounds on the deviation of the sum. More rigorously,

Theorem: Let X_1 , \dots, X_n be independent random \left \{ 0,1 \right \}-valued variables, and let X = \sum X_i. Suppose that \mu = \textup{E}(X). Then the probability that X deviates from \mu by more than a factor of \lambda > 0 is bounded from above:

\displaystyle \textup{P}(X > (1+\lambda)\mu) \leq \frac{e^{\lambda \mu}}{(1+\lambda)^{(1+\lambda)\mu}}

The proof is beyond the scope of this post, but we point the interested reader to these lecture notes.

We can apply the Chernoff bound in an easy example. Say all X_i are fair coin flips, and we’re interested in the probability of getting more than 3/4 of the coins heads. Here \mu = n/2 and \lambda = 1/2, so the probability is bounded from above by

\displaystyle \left ( \frac{e}{(3/2)^3} \right )^{n/4} \approx \frac{1}{5^n}

So as the number of coin flips grows, the probability of seeing such an occurrence diminishes extremely quickly to zero. This is important because if we want to test to see if, say, the coins are biased toward flipping heads, we can simply run an experiment with n sufficiently large. If we observe that more than 3/4 of the flips give heads, then we proclaim the coins are biased and we can be assured we are correct with high probability. Of course, after seeing 3/4 of more heads we’d be really confident that the coin is biased. A more realistic approach is to define some \varepsilon that is small enough so as to say, “if some event occurs whose probability is smaller than \varepsilon, then I call shenanigans.” Then decide how many coins and what bound one would need to make the bad event have probability approximately \varepsilon. Finding this balance is one of the more difficult aspects of probabilistic algorithms, and as we’ll see later all of these quantities are left as variables and the correct values are discovered in the course of the proof.

Chernoff-Hoeffding Inequality

The Hoeffding inequality (named after the Finnish statistician, Wassily Høffding) is a variant of the Chernoff bound, but often the bounds are collectively known as Chernoff-Hoeffding inequalities. The form that Hoeffding is known for can be thought of as a simplification and a slight generalization of Chernoff’s bound above.

Theorem: Let X_1, \dots, X_n be independent random variables whose values are within some range [a,b]. Call \mu_i = \textup{E}(X_i), X = \sum_i X_i, and \mu = \textup{E}(X) = \sum_i \mu_i. Then for all t > 0,

\displaystyle \textup{P}(|X - \mu| > t) \leq 2e^{-2t^2 / n(b-a)^2}

For example, if we are interested in the sum of n rolls of a fair six-sided die, then the probability that we deviate from (7/2)n by more than 5 \sqrt{n \log n} is bounded by 2e^{(-2 \log n)} = 2/n^2. Supposing we want to know how many rolls we need to guarantee with probability 0.01 that we don’t deviate too much, we just do the algebra:

2n^{-2} < 0.01
n^2 > 200
n > \sqrt{200} \approx 14

So with 15 rolls we can be confident that the sum of the rolls will lie between 20 and 85. It’s not the best possible bound we could come up with, because we’re completely ignoring the known structure on dice rolls (that they follow a uniform distribution!). The benefit is that it’s a quick and easy bound that works for any kind of random variable with that expected value.

Another version of this theorem concerns the average of the X_i, and is only a minor modification of the above.

Theorem: If X_1, \dots, X_n are as above, and X = \frac{1}{n} \sum_i X_i, with \mu = \frac{1}{n}(\sum_i \mu_i), then for all t > 0, we get the following bound

\displaystyle \textup{P}(|X - \mu| > t) \leq 2e^{-2nt^2/(b-a)^2}

The only difference here is the extra factor of n in the exponent. So the deviation is exponential both in the amount of deviation (t^2), and in the number of trials.

This theorem comes up very often in learning theory, in particular to prove Boosting works. Mathematicians will joke about how all theorems in learning theory are just applications of Chernoff-Hoeffding-type bounds. We’ll of course be seeing it again as we investigate boosting and the PAC-learning model in future posts, so we’ll see the theorems applied to their fullest extent then.

Until next time!