# Adversarial Bandits and the Exp3 Algorithm

In the last twenty years there has been a lot of research in a subfield of machine learning called Bandit Learning. The name comes from the problem of being faced with a large sequence of slot machines (once called one-armed bandits) each with a potentially different payout scheme. The problems in this field all focus on one central question:

If I have many available actions with uncertain outcomes, how should I act to maximize the quality of my results over many trials?

The deep question here is how to balance exploitation, the desire to choose an action which has payed off well in the past, with exploration, the desire to try options which may produce even better results. The ideas are general enough that it’s hard not to find applications: choosing which drug to test in a clinical study, choosing which companies to invest in, choosing which ads or news stories to display to users, and even (as Richard Feynman once wondered) how to maximize your dining enjoyment.

Herbert Robbins, one of the first to study bandit learning algorithms. Image credit

In less recent times (circa 1960′s), this problem was posed and considered in the case where the payoff mechanisms had a very simple structure: each slot machine is a coin flip with a different probability $p$ of winning, and the player’s goal is to find the best machine as quickly as possible. We called this the “stochastic” setting, and last time we saw a modern strategy called UCB1 which maintained statistical estimates on the payoffs of the actions and chose the action with the highest estimate. The underlying philosophy was “optimism in the face of uncertainty,” and it gave us something provably close to optimal.

Unfortunately payoff structures are more complex than coin flips in the real world. Having “optimism” is arguably naive, especially when it comes to competitive scenarios like stock trading. Indeed the algorithm we’ll analyze in this post will take the polar opposite stance, that payoffs could conceivably operate in any manner. This is called the adversarial model, because even though the payoffs are fixed in advance of the game beginning, it can always be the case that the next choice you make results in the worst possible payoff.

One might wonder how we can hope to do anything in such a pessimistic model. As we’ll see, our notion of performing well is relative to the best single slot machine, and we will argue that this is the only reasonable notion of success. On the other hand, one might argue that real world payoffs are almost never entirely adversarial, and so we would hope that algorithms which do well theoretically in the adversarial model excel beyond their minimal guarantees in practice.

In this post we’ll explore and implement one algorithm for adversarial bandit learning, called Exp3, and in the next post we’ll see how it fares against UCB1 in some applications. Some prerequisites: since the main algorithm presented in this post is randomized, its analysis requires some familiarity with techniques and notation from probability theory. Specifically, we will assume that the reader is familiar with the content of this blog’s basic probability theory primers (1, 2), though the real difficulty in the analysis will be keeping up with all of the notation.

In case the reader is curious, Exp3 was invented in 2001 by Auer, Cesa-Bianchi, Freund, and Schapire. Here is their original paper, which contains lots of other mathematical goodies.

As usual, all of the code and data produced in the making of this blog post is available for download on this blog’s Github page.

## Model Formalization and Notions of Regret

Before we describe the algorithm and analyze its we have to set up the problem formally. The first few paragraphs of our last post give a high-level picture of general bandit learning, so we won’t repeat that here. Recall, however, that we have to describe both the structure of the payoffs and how success is measured. So let’s describe the former first.

Definition: An adversarial bandit problem is a pair $(K, \mathbf{x})$, where $K$ represents the number of actions (henceforth indexed by $i$), and $\mathbf{x}$ is an infinite sequence of payoff vectors $\mathbf{x} = \mathbf{x}(1), \mathbf{x}(2), \dots$, where $\mathbf{x}(t) = (x_1(t), \dots, x_K(t))$ is a vector of length $K$ and $x_i(t) \in [0,1]$ is the reward of action $i$ on step $t$.

In English, the game is played in rounds (or “time steps”) indexed by $t = 1, 2, \dots$, and the payoffs are fixed for each action and time before the game even starts. Note that we assume the reward of an action is a number in the interval $[0,1]$, but all of our arguments in this post can be extended to payoffs in some range $[a,b]$ by shifting by $a$ and dividing by $b-a$.

Let’s specify what the player (algorithm designer) knows during the course of the game. First, the value of $K$ is given, and total number of rounds is kept secret. In each round, the player has access to the history of rewards for the actions that were chosen by the algorithm in previous rounds, but not the rewards of unchosen actions. In other words, it will only ever know one $x_i(t)$ for each $t$. To set up some notation, if we call $i_1, \dots, i_t$ the list of chosen actions over $t$ rounds, then at step $t+1$ the player has access to the values of $x_{i_1}(1), \dots, x_{i_t}(t)$ and must pick $i_{t+1}$ to continue.

So to be completely clear, the game progresses as follows:

The player is given access to $K$.
For each time step $t$:

The player must pick an action $i_t$.
The player observes the reward $x_i(t) \in [0,1]$, which he may save for future use.

The problem gives no explicit limit on the amount of computation performed during each step, but in general we want it to run in polynomial time and not depend on the round number $t$. If runtime even logarithmically depended on $t$, then we’d have a big problem using it for high-frequency applications. For example in ad serving, Google processes on the order of $10^9$ ads per day; so a logarithmic dependence wouldn’t be that bad, but at some point in the distance future Google wouldn’t be able to keep up (and we all want long-term solutions to our problems).

Note that the reward vectors $\mathbf{x}_t$ must be fixed in advance of the algorithm running, but this still allows a lot of counterintuitive things. For example, the payoffs can depend adversarially on the algorithm the player decides to use. For example, if the player chooses the stupid strategy of always picking the first action, then the adversary can just make that the worst possible action to choose. However, the rewards cannot depend on the random choices made by the player during the game.

So now let’s talk about measuring success. For an algorithm $A$ which chooses the sequence $i_1, \dots, i_t$ of actions, define $G_A(t)$ to be the sum of the observed rewards

$\displaystyle G_A(t) = \sum_{s=1}^t x_{i_s}(s)$.

And because $A$ will often be randomized, this value is a random variable depending on the decisions made by $A$. As such, we will often only consider the payoff up to expectation. That is, we’ll be interested in how $\textup{E}(G_A(t))$ relates to other possible courses of action. To be completely rigorous, the randomization is not over “choices made by an algorithm,” but rather the probability distribution over sequences of actions that the algorithm induces. It’s a fine distinction but a necessary one. In other words, we could define any sequence of actions $\mathbf{j} = (j_1, \dots, j_t)$ and define $G_{\mathbf{j}}(t)$ analogously as above:

$\displaystyle G_{\mathbf{j}}(t) = \sum_{s=1}^t x_{j_s}(s)$.

Any algorithm and choice of reward vectors induces a probability distribution over sequences of actions in a natural way (if you want to draw from the distribution, just run the algorithm). So instead of conditioning our probabilities and expectations on previous choices made by the algorithm, we do it over histories of actions $i_1, \dots, i_t$.

An obvious question we might ask is: why can’t the adversary just make all the payoffs zero? (or negative!) In this event the player won’t get any reward, but he can emotionally and psychologically accept this fate. If he never stood a chance to get any reward in the first place, why should he feel bad about the inevitable result? What a truly cruel adversary wants is, at the end of the game, to show the player what he could have won, and have it far exceed what he actually won. In this way the player feels regret for not using a more sensible strategy, and likely turns to binge eating cookie dough ice cream. Or more likely he returns to the casino to lose more money. The trick that the player has up his sleeve is precisely the randomness in his choice of actions, and he can use its objectivity to partially overcome even the nastiest of adversaries.

The adversary would love to show you this bluff after you choose to fold your hand. What a jerk. Image credit

Sadism aside, this thought brings us to a few mathematical notions of regret that the player algorithm may seek to minimize. The first, most obvious, and least reasonable is the worst-case regret. Given a stopping time $T$ and a sequence of actions $\mathbf{j} = (j_1, \dots, j_T)$, the expected regret of algorithm $A$ with respect to $\mathbf{j}$ is the difference $G_{\mathbf{j}}(T) - \mathbb{E}(G_A(T))$. This notion of regret measures the regret of a player if he knew what would have happened had he played $\mathbf{j}$.  The expected worst-case regret of $A$ is then the maximum over all sequences $\mathbf{j}$ of the regret of $A$ with respect to $\mathbf{j}$. This notion of regret seems particularly unruly, especially considering that the payoffs are adversarial, but there are techniques to reason about it.

However, the focus of this post is on a slightly easier notion of regret, called weak regret, which instead compares the results of $A$ to the best single action over all rounds. That is, this quantity is just

$\displaystyle \left ( \max_{j} \sum_{t=1}^T x_j(t) \right ) - \mathbb{E}(G_A(T))$

We call the parenthetical term $G_{\textup{max}}(T)$. This kind of regret is a bit easier to analyze, and the main theorem of this post will given an upper bound on it for Exp3. The reader who read our last post on UCB1 will wonder why we make a big distinction here just to arrive at the same definition of regret that we had in the stochastic setting. But with UCB1 the best sequence of actions to take just happened to be to play the best action over and over again. Here, the payoff difference between the best sequence of actions and the best single action can be arbitrarily large.

## Exp3 and an Upper Bound on Weak Regret

We now describe at the Exp3 algorithm.

Exp3 stands for Exponential-weight algorithm for Exploration and Exploitation. It works by maintaining a list of weights for each of the actions, using these weights to decide randomly which action to take next, and increasing (decreasing) the relevant weights when a payoff is good (bad). We further introduce an egalitarianism factor $\gamma \in [0,1]$ which tunes the desire to pick an action uniformly at random. That is, if $\gamma = 1$, the weights have no effect on the choices at any step.

The algorithm is readily described in Python code, but we need to set up some notation used in the proof of the theorem. The pseudocode for the algorithm is as follows.

Exp3

1. Given $\gamma \in [0,1]$, initialize the weights $w_i(1) = 1$ for $i = 1, \dots, K$.
2. In each round $t$:
1.  Set $\displaystyle p_i(t) = (1-\gamma)\frac{w_i(t)}{\sum_{j=1}^K w_j(t)} + \frac{\gamma}{K}$ for each $i$.
2. Draw the next action $i_t$ randomly according to the distribution of $p_i(t)$.
3. Observe reward $x_{i_t}(t)$.
4. Define the estimated reward $\hat{x}_{i_t}(t)$ to be $x_{i_t}(t) / p_{i_t}(t)$.
5. Set $\displaystyle w_{i_t}(t+1) = w_{i_t}(t) e^{\gamma \hat{x}_{i_t}(t) / K}$
6. Set all other $w_j(t+1) = w_j(t)$.

The choices of these particular mathematical quantities (in steps 1, 4, and 5) are a priori mysterious, but we will explain them momentarily. In the proof that follows, we will extend $\hat{x}_{i_t}(t)$ to indices other than $i_t$ and define those values to be zero.

The Python implementation is perhaps more legible, and implements the possibly infinite loop as a generator:

def exp3(numActions, reward, gamma):
weights = [1.0] * numActions

t = 0
while True:
probabilityDistribution = distr(weights, gamma)
choice = draw(probabilityDistribution)
theReward = reward(choice, t)

estimatedReward = 1.0 * theReward / probabilityDistribution[choice]
weights[choice] *= math.exp(estimatedReward * gamma / numActions) # important that we use estimated reward here!

yield choice, theReward, estimatedReward, weights
t = t + 1


Here the “rewards” variable refers to a callable which accepts as input the action chosen in round $t$ (keeps track of $t$, assuming we’ll play nice), and returns as output the reward for that choice. The distr and draw functions are also easily defined, with the former depending on the gamma parameter as follows:

def distr(weights, gamma=0.0):
theSum = float(sum(weights))
return tuple((1.0 - gamma) * (w / theSum) + (gamma / len(weights)) for w in weights)


There is one odd part of the algorithm above, and that’s the “estimated reward” $\hat{x}_{i_t}(t) = x_{i_t}(t) / p_{i_t}(t)$. The intuitive reason to do this is to compensate for a potentially small probability of getting the observed reward. More formally, it ensures that the conditional expectation of the “estimated reward” is the actual reward. We will explore this formally during the proof of the main theorem.

As usual, the programs we write in this post are available on this blog’s Github page.

We can now state and prove the upper bound on the weak regret of Exp3. Note all logarithms are base $e$.

Theorem: For any $K > 0, \gamma \in (0, 1]$, and any stopping time $T \in \mathbb{N}$

$\displaystyle G_{\textup{max}}(T) - \mathbb{E}(G_{\textup{Exp3}}(T)) \leq (e-1)\gamma G_{\textup{max}}(T) + \frac{K \log K}{\gamma}$.

This is a purely analytical result because we don’t actually know what $G_{\textup{max}}(T)$ is ahead of time. Also note how the factor of $\gamma$ occurs: in the first term, having a large $\gamma$ will result in a poor upper bound because it occurs in the numerator of that term: too much exploration means not enough exploitation. But it occurs in the denominator of the second term, meaning that not enough exploration can also produce an undesirably large regret. This theorem then provides a quantification of the tradeoff being made, although it is just an upper bound.

Proof.

We present the proof in two parts. Part 1:

We made a notable mistake in part 1, claiming that $e^x \leq 1 + x + (e-2)x^2$ when $x \leq 1$. In fact, this does follow from the Taylor series expansion of $e$, but it’s not as straightforward as I made it sound. In particular, note that $e^x = 1 + x + \frac{x^2}{2!} + \dots$, and so $e^1 = 2 + \sum_{k=2}^\infty \frac{1}{k!}$. Using $(e-2)$ in place of $\frac{1}{2}$ gives

$\displaystyle 1 + x + \left ( \sum_{k=2}^{\infty} \frac{x^2}{k!} \right )$

And since $0 < x \leq 1$, each term in the sum will decrease when replaced by $\frac{x^k}{k!}$, and we’ll be left with exactly $e^x$. In other words, this is the tightest possible quadratic upper bound on $e^x$. Pretty neat! On to part 2:

As usual, here is the entire canvas made over the course of both videos.

$\square$

We can get a version of this theorem that is easier to analyze by picking a suitable choice of $\gamma$.

Corollary: Assume that $G_{\textup{max}}(T)$ is bounded by $g$, and that Exp3 is run with

$\displaystyle \gamma = \min \left ( 1, \sqrt{\frac{K \log K}{(e-1)g}} \right )$

Then the weak regret of Exp3 is bounded by $2.63 \sqrt{g K \log K}$ for any reward vector $\mathbf{x}$.

Proof. Simply plug $\gamma$ in the bound in the theorem above, and note that $2 \sqrt{e-1} < 2.63$.

## A Simple Test Against Coin Flips

Now that we’ve analyzed the theoretical guarantees of the Exp3 algorithm, let’s use our implementation above and see how it fares in practice. Our first test will use 10 coin flips (Bernoulli trials) for our actions, with the probabilities of winning (and the actual payoff vectors) defined as follows:

biases = [1.0 / k for k in range(2,12)]
rewardVector = [[1 if random.random() < bias else 0 for bias in biases] for _ in range(numRounds)]
rewards = lambda choice, t: rewardVector[t][choice]


If we are to analyze the regret of Exp3 against the best action, we must compute the payoffs for all actions ahead of time, and compute which is the best. Obviously it will be the one with the largest probability of winning (the first in the list generated above), but it might not be, so we have to compute it. Specifically, it’s the following argmax:

bestAction = max(range(numActions), key=lambda action: sum([rewardVector[t][action] for t in range(numRounds)]))


Where the max function is used as “argmax” would be in mathematics.

We also have to pick a good choice of $\gamma$, and the corollary from the previous section gives us a good guide to the optimal $\gamma$: simply find a good upper bound on the reward of the best action, and use that. We can cheat a little here: we know the best action has a probability of 1/2 of paying out, and so the expected reward if we always did the best action is half the number of rounds. If we use, say, $g = 2T / 3$ and compute $\gamma$ using the formula from the corollary, this will give us a reasonable (but perhaps not perfectly correct) upper bound.

Then we just run the exp3 generator for $T = \textup{10,000}$ rounds, and compute some statistics as we go:

bestUpperBoundEstimate = 2 * numRounds / 3
gamma = math.sqrt(numActions * math.log(numActions) / ((math.e - 1) * bestUpperBoundEstimate))
gamma = 0.07

cumulativeReward = 0
bestActionCumulativeReward = 0
weakRegret = 0

t = 0
for (choice, reward, est, weights) in exp3(numActions, rewards, gamma):
cumulativeReward += reward
bestActionCumulativeReward += rewardVector[t][bestAction]

weakRegret = (bestActionCumulativeReward - cumulativeReward)
regretBound = (math.e - 1) * gamma * bestActionCumulativeReward + (numActions * math.log(numActions)) / gamma

t += 1
if t >= numRounds:
break


At the end of one run of ten thousand rounds, the weights are overwhelmingly in favor of the best arm. The cumulative regret is 723, compared to the theoretical upper bound of 897. It’s not too shabby, but by tinkering with the value of $\gamma$ we see that we can get regrets lower than 500 (when $\gamma$ is around 0.7). Considering that the cumulative reward for the player is around 4,500 in this experiment, that means we spent only about 500 rounds out of ten thousand exploring non-optimal options (and also getting unlucky during said exploration). Not too shabby at all.

Here is a graph of a run of this experiment.

A run of Exp3 against Bernoulli rewards. The first graph represents the simple regret of the player algorithm against the best action; the blue line is the actual simple regret, and the green line is the theoretical O(sqrt(k log k)) upper bound. The second graph shows the weights of each action evolving over time. The blue line is the weight of the best action, while the green and red lines are the weights of the second and third best actions.

Note how the Exp3 algorithm never stops increasing its regret. This is in part because of the adversarial model; even if Exp3 finds the absolutely perfect action to take, it just can’t get over the fact that the world might try to screw it over. As long as the $\gamma$ parameter is greater than zero, Exp3 will explore bad options just in case they turn out to be good. The benefits of this is that if the model changes over time Exp3 will adapt, but the downside is that the pessimism inherent in this worldview generally results in lower payoffs than other algorithms.

## More Variations, and Future Plans

Right now we have two contesting models of how the world works: is it stochastic and independent, like the UCB1 algorithm would optimize for? Or does it follow Exp3′s world view that the payoffs are adversarial? Next time we’ll run some real-world tests to see how each fares.

But before that, we should note that there are still more models we haven’t discussed. One extremely significant model is that of contextual bandits. That is, the real world settings we care about often come with some “context” associated with each trial. Ads being displayed to users have probabilities that should take into account the information known about the user, and medical treatments should take into account past medical history. While we will not likely investigate any contextual bandit algorithms on this blog in the near future, the reader who hopes to apply this work to his or her own exploits (no pun intended) should be aware of the additional reading.

Until next time!

# Optimism in the Face of Uncertainty: the UCB1 Algorithm

The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

## Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of $K$ actions labeled by the integers $\left \{ 1, 2, \dots, K \right \}$. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action $i$ be drawn from a fixed distribution $D_i$ (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm $A$ and a set of actions $\left \{1, 2, \dots, K \right \}$, the cumulative regret of $A$ in rounds $1, \dots, T$ is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of $A$ for the first $T$ rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.

## Notation

Let’s go ahead and draw up some notation. As before the actions are labeled by integers $\left \{ 1, \dots, K \right \}$. The reward of action $i$ is a $[0,1]$-valued random variable $X_i$ distributed according to an unknown distribution and possessing an unknown expected value $\mu_i$. The game progresses in rounds $t = 1, 2, \dots$ so that in each round we have different random variables $X_{i,t}$ for the reward of a single action $i$ in round $t$. The $X_{i,t}$ are independent as both $t$ and $i$ vary, although when $i$ varies the distribution changes.

So if we were to play action 2 over and over for $T$ rounds, then the total payoff would be the random variable $G_2(T) = \sum_{t=1}^T X_{2,t}$. But by independence across rounds and the linearity of expectation, the expected payoff is just $\mu_2 T$. So we can describe the best action as the action with the highest expected payoff. Define

$\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i$

We call the action which achieves the maximum $i^*$.

A policy is a randomized algorithm $A$ which picks an action in each round based on the history of chosen actions and observed rewards so far. Define $I_t$ to be the action played by $A$ in round $t$ and $P_i(n)$ to be the number of times we’ve played action $i$ in rounds $1 \leq t \leq n$. These are both random variables. Then the cumulative payoff for the algorithm $A$ over the first $T$ rounds, denoted $G_A(T)$, is just

$\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}$

and its expected value is simply

$\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T))$.

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy $A$ after the first $T$ steps, denoted $R_A(T)$ can be written as

$\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)$

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

$\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T))$.

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than $\Omega(\sqrt{KT})$. It is also known that there are algorithms that guarantee no worse than $O(\sqrt{KT})$ expected regret. The algorithm we’ll see in the next section, however, only guarantees $O(\sqrt{KT \log T})$. We present it on this blog because of its simplicity and ubiquity in the field.

## The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action $\mu_i$ is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose $Y_1, \dots, Y_n$ are independent random variables whose values lie in $[0,1]$ and whose expected values are $\mu_i$. Call $Y = \frac{1}{n}\sum_{i}Y_i$ and $\mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i$. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of $Y$ deviates from its mean. Specifically,

$\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}$

For us, the $Y_i$ will be the payoff variables for a single action $j$ in the rounds for which we choose action $j$. Then the variable $Y$ is just the empirical average payoff for action $j$ over all the times we’ve tried it. Moreover, $a$ is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for $a$ to find an upper bound big enough to be confident that we’re within $a$ of the true mean.

Indeed, if we call $n_j$ the number of times we played action $j$ thus far, then $n = n_j$ in the equation above, and using $a = a(j,T) = \sqrt{2 \log(T) / n_j}$ we get that $\textup{P}(Y > \mu + a) \leq T^{-4}$, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action $j$, its upper bound $a$ grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound $a(j, T)$ is getting bigger for actions we haven’t played; any round in which we play an action $j$, it must be that $a(j, T+1) = a(j,T)$, although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

UCB1:
Play each of the $K$ actions once, giving initial values for empirical mean payoffs $\overline{x}_i$ of each action $i$.
For each round $t = K, K+1, \dots$:

Let $n_j$ represent the number of times action $j$ was played so far.
Play the action $j$ maximizing $\overline{x}_j + \sqrt{2 \log t / n_j}$.
Observe the reward $X_{j,t}$ and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means $x_j$ change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with $K$ actions, each of whose reward distribution $X_{i,t}$ has values in [0,1]. Then its expected cumulative regret after $T$ rounds is at most $O(\sqrt{KT \log T})$.

Actually, we’ll prove a more specific theorem. Let $\Delta_i$ be the difference $\mu^* - \mu_i$, where $\mu^*$ is the expected payoff of the best action, and let $\Delta$ be the minimal nonzero $\Delta_i$. That is, $\Delta_i$ represents how suboptimal an action is and $\Delta$ is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret $\mathbb{E}(R_{\textup{UCB1}}(T))$ is at most

$\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )$

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if $\Delta_i$ is small we will require more tries to know that action $i$ is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the $1 + \pi^2 / 3$ part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the $O(\sqrt{KT \log T})$ bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function $R(\Delta_1, \dots, \Delta_K)$ (excluding of course $\Delta^*$), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

$\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}$

and it’s zero if and only if for each $i$, $\Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T})$. However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the $\Delta_i = O(\sqrt{\log T})$ (which are all the same) gives a total bound of $O(K \sqrt{\log T})$. If we look at the only possible endpoint (the $\Delta_i = 1$), then we get a local maximum of $O(K \sqrt{\log T})$. But this isn’t the $O(\sqrt{KT \log T})$ we promised, what gives? Well, this upper bound grows arbitrarily large as the $\Delta_i$ go to zero. But at the same time, if all the $\Delta_i$ are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the $\Delta_i = \Delta$ are the same, then another trivial regret bound is $\Delta T$ (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at $\Delta = O(\sqrt{(K \log T) / T})$. Hence the regret bound at the switch is $\Delta T = O(\sqrt{KT \log T})$, as desired.

## Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on $P_i(T)$, the expected number of times UCB chooses an action up to round $T$. Using the $\Delta$ notation, the regret is then just $\sum_i \Delta_i \mathbb{E}(P_i(T))$, and bounding the $P_i$‘s will bound the regret.

Recall the notation for our upper bound $a(j, T) = \sqrt{2 \log T / P_j(T)}$ and let’s loosen it a bit to $a(y, T) = \sqrt{2 \log T / y}$ so that we’re allowed to “pretend” a action has been played $y$ times. Recall further that the random variable $I_t$ has as its value the index of the machine chosen. We denote by $\chi(E)$ the indicator random variable for the event $E$. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., $\overline{x}^*$ is the empirical mean of the optimal action).

Indeed for any action $i$, the only way we know how to write down $P_i(T)$ is as

$\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)$

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action $i$. Now we’re just going to pull some number $m-1$ of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than $m$ times overall, this requires an inequality.

$P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)$

These indicator functions should be read as sentences: we’re just saying that we’re picking action $i$ in round $t$ and we’ve already played $i$ at least $m$ times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action $i$ in round $t$ means that the upper bound for action $i$ exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and $i$ might coincide with the best action, but that’s fine). In notation this event is

$\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)$

Denote the upper bound $\overline{x}_i + a(i,t)$ for action $i$ in round $t$ by $U_i(t)$. Since this event must occur every time we pick action $i$ (though not necessarily vice versa), we have

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)$

We’ll do this process again but with a slightly more complicated event. If the upper bound of action $i$ exceeds that of the optimal machine, it is also the case that the maximum upper bound for action $i$ we’ve seen after the first $m$ trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round $t$ we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine $i$ (except that it’s more than $m$). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by $\overline{x}_{i,s}$ the random variable for the empirical mean after playing action $i$ a total of $s$ times, and $\overline{x}^*_s$ the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )$

Indeed, at each $t$ for which the max is greater than the min, there will be at least one pair $s,s'$ for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs $s, s'$ for which it happens. That is, we can expand the event above into the double sum which is at least as large:

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )$

We can make one other odd inequality by increasing the sum to go from $t=1$ to $\infty$. This will become clear later, but it means we can replace $t-1$ with $t$ and thus have

$\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )$

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that $\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)$. Then what can we say? Well, consider the following three events:

(1) $\displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)$
(2) $\displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)$
(3) $\displaystyle \mu^* < \mu_i + 2a(s, t)$

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability $t^{-4}$. Likewise, (2) is the event that the empirical mean payoff of action $i$ is larger than the upper confidence bound, which also occurs with probability $t^{-4}$. We will see momentarily that (3) is impossible for a well-chosen $m$ (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

$\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}$

and

$\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}$

But putting these two inequalities together gives us precisely that (3) is true:

$\mu^* < \mu_i + 2a(s,t)$

This proves the claim.

By the union bound, the probability that at least one of these events happens is $2t^{-4}$ plus whatever the probability of (3) being true is. But as we said, we’ll pick $m$ to make (3) always false. Indeed $m$ depends on which action $i$ is being played, and if $s \geq m > 8 \log T / \Delta_i^2$ then $2a(s,t) \leq \Delta_i$, and by the definition of $\Delta_i$ we have

$\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0$.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

$\square$

## Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
return math.sqrt(2 * math.log(step + 1) / numPlays)


And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
payoffSums = [0] * numActions
numPlays = [1] * numActions
ucbs = [0] * numActions

# initialize empirical sums
for t in range(numActions):
payoffSums[t] = reward(t,t)
yield t, payoffSums[t], ucbs

t = numActions

while True:
ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
action = max(range(numActions), key=lambda i: ucbs[i])
theReward = reward(action, t)
numPlays[action] += 1
payoffSums[action] += theReward

yield action, theReward, ucbs
t = t + 1


The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from $[0,1]$, biased by $1/k$ for some $5 \leq k \leq 15$. The regret and theoretical regret bound are given in the graph below.

The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

## Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!

# The Erdős-Rényi Random Graph

During the 1950′s the famous mathematician Paul Erdős and Alfred Rényi put forth the concept of a random graph and in the subsequent years of study transformed the world of combinatorics. The random graph is the perfect example of a good mathematical definition: it’s simple, has surprisingly intricate structure, and yields many applications.

In this post we’ll explore basic facts about random graphs, slowly detail a proof on their applications to graph theory, and explore their more interesting properties computationally (a prelude to proofs about their structure). We assume the reader is familiar with the definition of a graph, which we’ve written about at length for non-mathematical audiences, and has some familiarity with undergraduate-level probability and combinatorics for the more mathematical sections of the post. We’ll do our best to remind the reader of these prerequisites as we go, and we welcome any clarification questions in the comment section.

## The Erdős-Rényi Model

The definition of a random graph is simple enough that we need not defer it to the technical section of the article.

Definition: Given a positive integer $n$ and a probability value $0 \leq p \leq 1$, define the graph $G(n,p)$ to be the undirected graph on $n$ vertices whose edges are chosen as follows. For all pairs of vertices $v,w$ there is an edge $(v,w)$ with probability $p$.

Of course, there is no single random graph. What we’ve described here is a process for constructing a graph. We create a set of $n$ vertices, and for each possible pair of vertices we flip a coin (often a biased coin) to determine if we should add an edge connecting them. Indeed, every graph can be made by this process if one is sufficiently lucky (or unlucky), but it’s very unlikely that we will have no edges at all if $p$ is large enough. So $G(n,p)$ is really a probability distribution over the set of all possible graphs on $n$ vertices. We commit a horrendous abuse of notation by saying $G$ or $G(n,p)$ is a random graph instead of saying that $G$ is sampled from the distribution. The reader will get used to it in time.

## Why Do We Care?

Random graphs of all sorts (not just Erdős’s model) find applications in two very different worlds. The first is pure combinatorics, and the second is in the analysis of networks of all kinds.

In combinatorics we often wonder if graphs exist with certain properties. For instance, in graph theory we have the notion of graph colorability: can we color the vertices of a graph with $k$ colors so that none of its edges are monochromatic? (See this blog’s primer on graph coloring for more) Indeed, coloring is known to be a very difficult problem on general graphs. The problem of determining whether a graph can be colored with a fixed number of colors has no known efficient algorithm; it is NP-complete. Even worse, much of our intuition about graphs fails for graph coloring. We would expect that sparse-looking graphs can be colored with fewer colors than dense graphs. One naive way to measure sparsity of a graph is to measure the length of its shortest cycle (recall that a cycle is a path which starts and ends at the same vertex). This measurement is called the girth of a graph. But Paul Erdős proved using random graphs, as we will momentarily, that for any choice of integers $g,k$ there are graphs of girth $\geq g$ which cannot be colored with fewer than $k$ colors. Preventing short cycles in graphs doesn’t make coloring easier.

The role that random graphs play in this picture is to give us ways to ensure the existence of graphs with certain properties, even if we don’t know how to construct an example of such a graph. Indeed, for every theorem proved using random graphs, there is a theorem (or open problem) concerning how to algorithmically construct those graphs which are known to exist.

Pure combinatorics may not seem very useful to the real world, but models of random graphs (even those beyond the Erdős-Rényi model) are quite relevant. Here is a simple example. One can take a Facebook user $u$ and form a graph of that users network of immediate friends $N(u)$ (excluding $u$ itself), where vertices are people and two people are connected by an edge if they are mutual friends; call this the user’s friendship neighborhood. It turns out that the characteristics of the average Facebook user’s friendship neighborhood resembles a random graph. So understanding random graphs helps us understand the structure of small networks of friends. If we’re particularly insightful, we can do quite useful things like identify anomalies, such as duplicitous accounts, which deviate quite far from the expected model. They can also help discover trends or identify characteristics that can allow for more accurate ad targeting. For more details on how such an idea is translated into mathematics and code, see Modularity (we plan to talk about modularity on this blog in the near future; lots of great linear algebra there!).

Random graphs, when they exhibit observed phenomena, have important philosophical consequences. From a bird’s-eye view, there are two camps of scientists. The first are those who care about leveraging empirically observed phenomena to solve problems. Many statisticians fit into this realm: they do wonderful magic with data fit to certain distributions, but they often don’t know and don’t care whether the data they use truly has their assumed properties. The other camp is those who want to discover generative models for the data with theoretical principles. This is more like theoretical physics, where we invent an arguably computational notion of gravity whose consequences explain our observations.

For applied purposes, the Erdős-Rényi random graph model is in the second camp. In particular, if something fits in the Erdős-Rényi model, then it’s both highly structured (as we will make clear in the sequel) and “essentially random.” Bringing back the example of Facebook, this says that most people in the average user’s immediate friendship neighborhood are essentially the same and essentially random in their friendships among the friends of $u$. This is not quite correct, but it’s close enough to motivate our investigation of random graph models. Indeed, even Paul Erdős in his landmark paper mentioned that equiprobability among all vertices in a graph is unrealistic. See this survey for a more thorough (and advanced!) overview, and we promise to cover models which better represent social phenomena in the future.

So lets go ahead and look at a technical proof using random graphs from combinatorics, and then write some programs to generate random graphs.

## Girth and Chromatic Number, and Counting Triangles

As a fair warning, this proof has a lot of moving parts. Skip to the next section if you’re eager to see some programs.

Say we have a $k$ and a $g$, and we wonder whether a graph can exist which simultaneously has no cycles of length less than $g$ (the girth) and needs at least $k$ colors to color. The following theorem settles this question affirmatively.  A bit of terminology: the chromatic number of a graph $G$, denoted $\chi(G)$, is the smallest number of colors needed to properly color $G$.

Theorem: For any natural numbers $k,g$, there exist graphs of chromatic number at least $k$ and girth at least $g$.

Proof. Taking our cue from random graphs, let’s see what the probability is that a random graph $G(n,p)$ on $n$ vertices will have our desired properties. Or easier, what’s the chance that it will not have the right properties? This is essentially a fancy counting argument, but it’s nicer if we phrase it in the language of probability theory.

The proof has a few twists and turns for those uninitiated to the probabilistic method of proof. First, we will look at an arbitrary $G(n,p)$ (where $n,p$ are variable) and ask two questions about it: what is the expected number of short cycles, and what is the expected “independence number” (which we will see is related to coloring). We’ll then pick a value of $p$, depending crucially on $n$, which makes both of these expectations small. Next, we’ll use the fact that if the probability that $G(n,p)$ doesn’t have our properties is strictly less than 1, then there has to be some instance in our probability space which has those properties (if no instance had the property, then the probability would be one!). Though we will not know what the graphs look like, their existence is enough to prove the theorem.

So let’s start with cycles. If we’re given a desired girth of $g$, the expected number of cycles of length $\leq g$ in $G(n,p)$ can be bounded by $(np)^{g+1}/(np-1)$. To see this, the two main points are how to count the number of ways to pick $k$ vertices in order to form a cycle, and a typical fact about sums of powers. Indeed, we can think of a cycle of length $k$ as a way to seat a choice of $k$ people around a circular table. There are $\binom{n}{k}$ possible groups of people, and $(k-1)!$ ways to seat one group. If we fix $k$ and let $n$ grow, then the product $\binom{n}{k}(k-1)!$ will eventually be smaller than $n^k$ (actually, this happens almost immediately in almost all cases). For each such choice of an ordering, the probability of the needed edges occurring to form a cycle is $p^j$, since all edges must occur independently of each other with probability $p$.

So the probability that we get a cycle of $j$ vertices is

$\displaystyle \binom{n}{j}(j-1)!p^j$

And by the reasoning above we can bound this by $n^jp^j$. Summing over all numbers $j = 3, \dots, g$ (we are secretly using the union bound), we bound the expected number of cycles of length $\leq g$ from above:

$\displaystyle \sum_{j=3}^g n^j p^j < \sum_{j=0}^g n^j p^j = \frac{(np)^{g+1}}{np - 1}$

Since we want relatively few cycles to occur, we want it to be the case that the last quantity, $(np)^{j+1}/(np-1)$, goes to zero as $n$ goes to infinity. One trick is to pick $p$ depending on $n$. If $p = n^l$, our upper bound becomes $n^{(l+1)(g+1)} / (n^{1+l} - 1)$, and if we want the quantity to tend to zero it must be that $(l+1)(g+1) < 1$. Solving this we get that $-1 < l < \frac{1}{g+1} - 1 < 0$. Pick such an $l$ (it doesn’t matter which), and keep this in mind: for our choice of $p$, the expected number of cycles goes to zero as $n$ tends to infinity.

On the other hand, we want to make sure that such a graph has high chromatic number. To do this we’ll look at a related property: the size of the largest independent set. An independent set of a graph $G = (V,E)$ is a set of vertices $S \subset V$ so that there are no edges in $E$ between vertices of $S$. We call $\alpha(G)$ the size of the largest independent set. The values $\alpha(G)$ and $\chi(G)$ are related, because any time you have an independent set you can color all the vertices with a single color. In particular, this proves the inequality $\chi(G) \alpha(G) \geq n$, the number of vertices of $G$, or equivalently $\chi(G) \geq n / \alpha(G)$. So if we want to ensure $\chi(G)$ is large, it suffices to show $\alpha(G)$ is small (rigorously, $\alpha(G) \leq n / k$ implies $\chi(G) \geq k$).

The expected number of independent sets (again using the union bound) is at most the product of the number of possible independent sets and the probability of one of these having no edges. We let $r$ be arbitrary and look at independent sets of size $r$ Since there are $\binom{n}{r}$ sets and each has a probability $(1-p)^{\binom{r}{2}}$ of being independent, we get the probability that there is an independent set of size $r$ is bounded by

$\displaystyle \textup{P}(\alpha(G) \geq r) \leq \binom{n}{r}(1-p)^{\binom{r}{2}}$

We use the fact that $1-x < e^{-x}$ for all $x$ to translate the $(1-p)$ part. Combining this with the usual $\binom{n}{r} \leq n^r$ we get the probability of having an independent set of size $r$ at most

$\displaystyle \textup{P}(\alpha(G) \geq r) \leq \displaystyle n^re^{-pr(r-1)/2}$

Now again we want to pick $r$ so that this quantity goes to zero asymptotically, and it’s not hard to see that $r = \frac{3}{p}\log(n)$ is good enough. With a little arithmetic we get the probability is at most $n^{(1-a)/2}$, where $a > 1$.

So now we have two statements: the expected number of short cycles goes to zero, and the probability that there is an independent set of size at least $r$ goes to zero. If we pick a large enough $n$, then the expected number of short cycles is less than $n/5$, and using Markov’s inequality we see that the probability that there are more than $n/2$ cycles of length at most $g$ is strictly less than 1/2. At the same time, if we pick a large enough $n$ then $\alpha(G) \geq r$ with probability strictly less than 1/2. Combining these two (once more with the union bound), we get

$\textup{P}(\textup{at least } n/2 \textup{ cycles of length } \leq g \textup{ and } \alpha(G) \geq r) < 1$

Now we can conclude that for all sufficiently large $n$ there has to be a graph on at least $n$ vertices which has neither of these two properties! Pick one and call it $G$. Now $G$ still has cycles of length $\leq g$, but we can fix that by removing a vertex from each short cycle (it doesn’t matter which). Call this new graph $G'$. How does this operation affect independent sets, i.e. what is $\alpha(G')$? Well removing vertices can only decrease the size of the largest independent set. So by our earlier inequality, and calling $n'$ the number of vertices of $G'$, we can make a statement about the chromatic number:

$\displaystyle \chi(G') \geq \frac{n'}{\alpha(G')} \geq \frac{n/2}{\log(n) 3/p} = \frac{n/2}{3n^l \log(n)} = \frac{n^{1-l}}{6 \log(n)}$

Since $-1 < l < 0$ the numerator grows asymptotically faster than the denominator, and so for sufficiently large $n$ the chromatic number will be larger than any $k$ we wish. Hence we have found a graph with girth at least $g$ and chromatic number at least $k$.

$\square$

## Connected Components

The statistical properties of a random graph are often quite easy to reason about. For instance, the degree of each vertex in $G(n,p)$ is $np$ in expectation. Local properties like this are easy, but global properties are a priori very mysterious. One natural question we can ask in this vein is: when is $G(n,p)$ connected? We would very much expect the answer to depend on how $p$ changes in relation to $n$. For instance, $p$ might look like $p(n) = 1/n^2$ or $\log(n) / n$ or something similar. We could ask the following question:

As $n$ tends to infinity, what limiting proportion of random graphs $G(n,p)$ are connected?

Certainly for some $p(n)$ which are egregiously small (for example, $p(n) = 0$), the answer will be that no graphs are connected. On the other extreme, if $p(n) = 1$ then all graphs will be connected (and complete graphs, at that!). So our goal is to study the transition phase between when the graphs are disconnected and when they are connected. A priori this boundary could be a gradual slope, where the proportion grows from zero to one, or it could be a sharp jump. Next time, we’ll formally state and prove the truth, but for now let’s see if we can figure out which answer to expect by writing an exploratory program.

We wrote the code for this post in Python, and as usual it is all available for download on this blog’s Github page.

We start with a very basic Node class to represent each vertex in a graph, and a function to generate random graphs

import random
class Node:
def __init__(self, index):
self.index = index
self.neighbors = []

def __repr__(self):
return repr(self.index)

def randomGraph(n,p):
vertices = [Node(i) for i in range(n)]
edges = [(i,j) for i in xrange(n) for j in xrange(i) if random.random() < p]

for (i,j) in edges:
vertices[i].neighbors.append(vertices[j])
vertices[j].neighbors.append(vertices[i])

return vertices


The randomGraph function simply creates a list of edges chosen uniformly at random from all possible edges, and then constructs the corresponding graph. Next we have a familiar sight: the depth-first search. We use it to compute the graph components one by one (until all vertices have been found in some run of a DFS).

def dfsComponent(node, visited):
for v in node.neighbors:
if v not in visited:
dfsComponent(v, visited)

def connectedComponents(vertices):
components = []
cumulativeVisited = set()

for v in vertices:
if v not in cumulativeVisited:
componentVisited = set([v])
dfsComponent(v, componentVisited)

components.append(componentVisited)
cumulativeVisited |= componentVisited

return components


The dfsComponent function simply searches in breadth-first fashion, adding every vertex it finds to the “visited” set.  The connectedComponents function keeps track of the list of components found so far, as well as the cumulative set of all vertices found in any run of bfsComponent. Hence, as we iterate through the vertices we can ignore vertices we’ve found in previous runs of bfsComponent. The “x |= y” notation is python shorthand for updating x via a union with y.

Finally, we can make a graph of the largest component of (independently generated) random graphs as the probability of an edge varies.

def sizeOfLargestComponent(vertices):
return max(len(c) for c in connectedComponents(vertices))

def graphLargestComponentSize(n, theRange):
return [(p, sizeOfLargestComponent(randomGraph(n, p))) for p in theRange]


Running this code and plotting it for $p$ varying from zero to 0.5 gives the following graph.

The blue line is the size of the largest component, and the red line gives a moving average estimate of the data.  As we can see, there is a very sharp jump peaking at $p=0.1$ at which point the whole graph is connected. It would appear there is a relatively quick “phase transition” happening here. Let’s zoom in closer on the interesting part.

It looks like the transition begins around 0.02, and finishes at around 0.1. Interesting… Let’s change the parameters a bit, and increase the size of the graph. Here’s the same chart (in the same range of $p$ values) for a graph with a hundred vertices.

Now the phase transition appears to have shifted to about $(0.01, 0.05)$, which is about multiplying the endpoints of the phase transition interval above by 1/2. The plot thickens… Once more, let’s move up to a graph on 500 vertices.

Again it’s too hard to see, so let’s zoom in.

This one looks like the transition starts at 0.002 and ends at 0.01. This is a 5-fold decrease from the previous one, and we increased the number of vertices by 5. Could this be a pattern? Here’s a conjecture to formalize it:

Conjecture: The random graph $G(n,p)$ enters a phase transition at $p=1/n$ and becomes connected almost surely at $p=5/n$.

This is not quite rigorous enough to be a true conjecture, but it sums up our intuition that we’ve learned so far. Just to back this up even further, here’s an animation showing the progression of the phase transition as $n = 20 \dots 500$ in steps of twenty. Note that the $p$ range is changing to maintain our conjectured window.

Looks pretty good. Next time we’ll see some formal mathematics validating our intuition (albeit reformulated in a nicer way), and we’ll continue to investigate other random graph models.

Until then!

# Reservoir Sampling

Problem: Given a data stream of unknown size $n$, pick an entry uniformly at random. That is, each entry has a $1/n$ chance of being chosen.

Solution: (in Python)

import random

def reservoirSample(stream):
for k,x in enumerate(stream, start=1):
if random.random() < 1.0 / k:
chosen = x

return chosen

Discussion: This is one of many techniques used to solve a problem called reservoir sampling. We often encounter data sets that we’d like to sample elements from at random. But with the advent of big data, the lists involved are so large that we either can’t fit it all at once into memory or we don’t even know how big it is because the data is in the form of a stream (e.g., the number of atomic price movements in the stock market in a year). Reservoir sampling is the problem of sampling from such streams, and the technique above is one way to achieve it.

In words, the above algorithm holds one element from the stream at a time, and when it inspects the $k$-th element (indexing $k$ from 1), it flips a coin of bias $1/k$ to decide whether to keep its currently held element or to drop it in favor of the new one.

We can prove quite easily by induction that this works. Indeed, let $n$ be the (unknown) size of the list, and suppose $n=1$. In this case there is only one element to choose from, and so the probability of picking it is 1. The case of $n=2$ is similar, and more illustrative. Now suppose the algorithm works for $n$ and suppose we increase the size of the list by 1 adding some new element $y$ to the end of the list. For any given $x$ among the first $n$ elements, the probability we’re holding $x$ when we  inspect $y$ is $1/n$ by induction. Now we flip a coin which lands heads with probability $1/(n+1)$, and if it lands heads we take $y$ and otherwise we keep $x$. The probability we get $y$ is exactly $1/(n+1)$, as desired, and the probability we get $x$ is $\frac{1}{n}\frac{n}{n+1} = \frac{1}{n+1}$. Since $x$ was arbitrary, this means that after the last step of the algorithm each entry is held with probability $1/(n+1)$.

$\square$

It’s easy to see how one could increase the number of coins being flipped to provide a sampling algorithm to pick any finite number of elements (with replacement, although a variant without replacement is not so hard to construct using this method). Other variants, exist, such as distributed and weighted sampling.

Python’s generators make this algorithm for reservoir sampling particularly nice. One can define a generator which abstractly represents a data stream (perhaps querying the entries from files distributed across many different disks), and this logic is hidden from the reservoir sampling algorithm. Indeed, this algorithm works for any iterable, although if we knew the size of the list we could sample much faster (by uniformly generating a random number and indexing the list appropriately). The start parameter given to the enumerate function makes the $k$ variable start at 1.