# Optimism in the Face of Uncertainty: the UCB1 Algorithm

The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

## Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of $K$ actions labeled by the integers $\left \{ 1, 2, \dots, K \right \}$. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action $i$ be drawn from a fixed distribution $D_i$ (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm $A$ and a set of actions $\left \{1, 2, \dots, K \right \}$, the cumulative regret of $A$ in rounds $1, \dots, T$ is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of $A$ for the first $T$ rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.

## Notation

Let’s go ahead and draw up some notation. As before the actions are labeled by integers $\left \{ 1, \dots, K \right \}$. The reward of action $i$ is a $[0,1]$-valued random variable $X_i$ distributed according to an unknown distribution and possessing an unknown expected value $\mu_i$. The game progresses in rounds $t = 1, 2, \dots$ so that in each round we have different random variables $X_{i,t}$ for the reward of a single action $i$ in round $t$. The $X_{i,t}$ are independent as both $t$ and $i$ vary, although when $i$ varies the distribution changes.

So if we were to play action 2 over and over for $T$ rounds, then the total payoff would be the random variable $G_2(T) = \sum_{t=1}^T X_{2,t}$. But by independence across rounds and the linearity of expectation, the expected payoff is just $\mu_2 T$. So we can describe the best action as the action with the highest expected payoff. Define

$\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i$

We call the action which achieves the maximum $i^*$.

A policy is a randomized algorithm $A$ which picks an action in each round based on the history of chosen actions and observed rewards so far. Define $I_t$ to be the action played by $A$ in round $t$ and $P_i(n)$ to be the number of times we’ve played action $i$ in rounds $1 \leq t \leq n$. These are both random variables. Then the cumulative payoff for the algorithm $A$ over the first $T$ rounds, denoted $G_A(T)$, is just

$\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}$

and its expected value is simply

$\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T))$.

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy $A$ after the first $T$ steps, denoted $R_A(T)$ can be written as

$\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)$

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

$\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T))$.

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than $\Omega(\sqrt{KT})$. It is also known that there are algorithms that guarantee no worse than $O(\sqrt{KT})$ expected regret. The algorithm we’ll see in the next section, however, only guarantees $O(\sqrt{KT \log T})$. We present it on this blog because of its simplicity and ubiquity in the field.

## The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action $\mu_i$ is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose $Y_1, \dots, Y_n$ are independent random variables whose values lie in $[0,1]$ and whose expected values are $\mu_i$. Call $Y = \frac{1}{n}\sum_{i}Y_i$ and $\mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i$. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of $Y$ deviates from its mean. Specifically,

$\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}$

For us, the $Y_i$ will be the payoff variables for a single action $j$ in the rounds for which we choose action $j$. Then the variable $Y$ is just the empirical average payoff for action $j$ over all the times we’ve tried it. Moreover, $a$ is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for $a$ to find an upper bound big enough to be confident that we’re within $a$ of the true mean.

Indeed, if we call $n_j$ the number of times we played action $j$ thus far, then $n = n_j$ in the equation above, and using $a = a(j,T) = \sqrt{2 \log(T) / n_j}$ we get that $\textup{P}(Y > \mu + a) \leq T^{-4}$, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action $j$, its upper bound $a$ grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound $a(j, T)$ is getting bigger for actions we haven’t played; any round in which we play an action $j$, it must be that $a(j, T+1) = a(j,T)$, although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

UCB1:
Play each of the $K$ actions once, giving initial values for empirical mean payoffs $\overline{x}_i$ of each action $i$.
For each round $t = K, K+1, \dots$:

Let $n_j$ represent the number of times action $j$ was played so far.
Play the action $j$ maximizing $\overline{x}_j + \sqrt{2 \log t / n_j}$.
Observe the reward $X_{j,t}$ and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means $x_j$ change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with $K$ actions, each of whose reward distribution $X_{i,t}$ has values in [0,1]. Then its expected cumulative regret after $T$ rounds is at most $O(\sqrt{KT \log T})$.

Actually, we’ll prove a more specific theorem. Let $\Delta_i$ be the difference $\mu^* - \mu_i$, where $\mu^*$ is the expected payoff of the best action, and let $\Delta$ be the minimal nonzero $\Delta_i$. That is, $\Delta_i$ represents how suboptimal an action is and $\Delta$ is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret $\mathbb{E}(R_{\textup{UCB1}}(T))$ is at most

$\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )$

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if $\Delta_i$ is small we will require more tries to know that action $i$ is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the $1 + \pi^2 / 3$ part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the $O(\sqrt{KT \log T})$ bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function $R(\Delta_1, \dots, \Delta_K)$ (excluding of course $\Delta^*$), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

$\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}$

and it’s zero if and only if for each $i$, $\Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T})$. However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the $\Delta_i = O(\sqrt{\log T})$ (which are all the same) gives a total bound of $O(K \sqrt{\log T})$. If we look at the only possible endpoint (the $\Delta_i = 1$), then we get a local maximum of $O(K \sqrt{\log T})$. But this isn’t the $O(\sqrt{KT \log T})$ we promised, what gives? Well, this upper bound grows arbitrarily large as the $\Delta_i$ go to zero. But at the same time, if all the $\Delta_i$ are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the $\Delta_i = \Delta$ are the same, then another trivial regret bound is $\Delta T$ (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at $\Delta = O(\sqrt{(K \log T) / T})$. Hence the regret bound at the switch is $\Delta T = O(\sqrt{KT \log T})$, as desired.

## Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on $P_i(T)$, the expected number of times UCB chooses an action up to round $T$. Using the $\Delta$ notation, the regret is then just $\sum_i \Delta_i \mathbb{E}(P_i(T))$, and bounding the $P_i$‘s will bound the regret.

Recall the notation for our upper bound $a(j, T) = \sqrt{2 \log T / P_j(T)}$ and let’s loosen it a bit to $a(y, T) = \sqrt{2 \log T / y}$ so that we’re allowed to “pretend” a action has been played $y$ times. Recall further that the random variable $I_t$ has as its value the index of the machine chosen. We denote by $\chi(E)$ the indicator random variable for the event $E$. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., $\overline{x}^*$ is the empirical mean of the optimal action).

Indeed for any action $i$, the only way we know how to write down $P_i(T)$ is as

$\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)$

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action $i$. Now we’re just going to pull some number $m-1$ of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than $m$ times overall, this requires an inequality.

$P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)$

These indicator functions should be read as sentences: we’re just saying that we’re picking action $i$ in round $t$ and we’ve already played $i$ at least $m$ times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action $i$ in round $t$ means that the upper bound for action $i$ exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and $i$ might coincide with the best action, but that’s fine). In notation this event is

$\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)$

Denote the upper bound $\overline{x}_i + a(i,t)$ for action $i$ in round $t$ by $U_i(t)$. Since this event must occur every time we pick action $i$ (though not necessarily vice versa), we have

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)$

We’ll do this process again but with a slightly more complicated event. If the upper bound of action $i$ exceeds that of the optimal machine, it is also the case that the maximum upper bound for action $i$ we’ve seen after the first $m$ trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round $t$ we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine $i$ (except that it’s more than $m$). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by $\overline{x}_{i,s}$ the random variable for the empirical mean after playing action $i$ a total of $s$ times, and $\overline{x}^*_s$ the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )$

Indeed, at each $t$ for which the max is greater than the min, there will be at least one pair $s,s'$ for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs $s, s'$ for which it happens. That is, we can expand the event above into the double sum which is at least as large:

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )$

We can make one other odd inequality by increasing the sum to go from $t=1$ to $\infty$. This will become clear later, but it means we can replace $t-1$ with $t$ and thus have

$\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )$

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that $\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)$. Then what can we say? Well, consider the following three events:

(1) $\displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)$
(2) $\displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)$
(3) $\displaystyle \mu^* < \mu_i + 2a(s, t)$

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability $t^{-4}$. Likewise, (2) is the event that the empirical mean payoff of action $i$ is larger than the upper confidence bound, which also occurs with probability $t^{-4}$. We will see momentarily that (3) is impossible for a well-chosen $m$ (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

$\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}$

and

$\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}$

But putting these two inequalities together gives us precisely that (3) is true:

$\mu^* < \mu_i + 2a(s,t)$

This proves the claim.

By the union bound, the probability that at least one of these events happens is $2t^{-4}$ plus whatever the probability of (3) being true is. But as we said, we’ll pick $m$ to make (3) always false. Indeed $m$ depends on which action $i$ is being played, and if $s \geq m > 8 \log T / \Delta_i^2$ then $2a(s,t) \leq \Delta_i$, and by the definition of $\Delta_i$ we have

$\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0$.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

$\square$

## Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
return math.sqrt(2 * math.log(step + 1) / numPlays)


And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
payoffSums = [0] * numActions
numPlays = [1] * numActions
ucbs = [0] * numActions

# initialize empirical sums
for t in range(numActions):
payoffSums[t] = reward(t,t)
yield t, payoffSums[t], ucbs

t = numActions

while True:
ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
action = max(range(numActions), key=lambda i: ucbs[i])
theReward = reward(action, t)
numPlays[action] += 1
payoffSums[action] += theReward

yield action, theReward, ucbs
t = t + 1


The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from $[0,1]$, biased by $1/k$ for some $5 \leq k \leq 15$. The regret and theoretical regret bound are given in the graph below.

The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

## Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!

# The Universal Properties of Map, Fold, and Filter

A lot of people who like functional programming often give the reason that the functional style is simply more elegant than the imperative style. When compelled or inspired to explain (as I did in my old post, How I Learned to Love Functional Programming), they often point to the three “higher-order” functions map, fold, and filter, as providing a unifying framework for writing and reasoning about programs. But how unifying are they, really? In this post we’ll give characterizations of these functions in terms of universal properties, and we’ll see that in fact fold is the “most” universal of the three, and its natural generalization gives a characterization of transformations of standard compound data types.

By being universal or having a universal property, we don’t mean that map, fold, and filter are somehow mystically connected to all programming paradigms. That might be true, but it’s not the point of this article. Rather, by saying something has a universal property we are making a precise mathematical statement that it is either an initial or final object (or the unique morphism determined by such) in some category.

That means that, as a fair warning to the reader, this post assumes some basic knowledge of category theory, but no more than what we’ve covered on this blog in the past. Of particular importance for the first half of this post is the concept of a universal property, and in the followup post we’ll need some basic knowledge of functors.

## Map, Filter, and Fold

Recalling their basic definitions, map is a function which accepts as input a list $L$ whose elements all have the same type (call it $A$), and a function $f$ which maps $A$ to another type $B$. Map then applies $f$ to every entry of $L$ to produce a new list whose entries all have type $B$.

In most languages, implementing the map function is an elementary exercise. Here is one possible definition in ML.

fun map(_, nil) = nil


Next, filter takes as input a boolean-valued predicate $p : A \to \textup{bool}$ on types $A$ and a list $L$ whose entries have type $A$, and produces a list of those entries of $L$ which satisfy the predicate. Again, it’s implementation in ML might be:

fun filter(_, nil) = nil
else filter(p, tail)


Finally, fold is a function which “reduces” a list $L$ of entries with type $A$ down to a single value of type $B$. It accepts as input a function $f : A \times B \to B$, and an initial value $v \in B$, and produces a value of type $B$ by recursively applying $f$ as follows:

fun fold(_, v, nil) = v


(If this definition is mysterious, you’re probably not ready for the rest of this post.)

One thing that’s nice about fold is that we can define map and filter in terms of it:

fun map(f, L) = fold((fn x, xs => (f(x):xs)), [], L)
fun filter(p, L) = fold((fn x, xs => if p(x) then x:xs else xs end), [], L)


We’ll see that this is no accident: fold happens to have the “most general” universality of the three functions, but as a warm-up and a reminder we first investigate the universality of map and filter.

## Free Monoids and Map

Map is the easiest function of the three to analyze, but to understand it we need to understand something about monoids. A monoid is simple to describe, it’s just a set $M$ with an associative binary operation denoted by multiplication and an identity element for that operation.

monoid homomoprhism between two monoids $M, N$ is a function $f: M \to N$ such that $f(xy) = f(x)f(y)$. Here the multiplication on the left hand side of the equality is the operation from $M$ and on the right it’s the one from $N$ (this is the same definition as a group homomorphism). As is to be expected, monoids with monoid homomorphisms form a category.

We encounter simple monoids every day in programming which elucidate this definition. Strings with the operation of string concatenation form a monoid, and the empty string acts as an identity because concatenating a string to the empty string has no effect. Similarly, lists with list concatenation form a monoid, where the identity is the empty list. A nice example of a monoid homomorphism is the length function. We know it’s a homomorphism because the length of a concatenation of two lists is just the sum of the lengths of the two lists.

Integers also form a monoid, with addition as the operation and zero as the identity element. However, the list and string monoids have an extra special property that integers do not. For a number $n$ you can always find $-n$ so that $n + (-n) = 0$ is the identity element. But for lists, the only way to concatenate two lists and get the empty list is if both of the lists were empty to begin with. A monoid with this property is called free, and to fully understand it we should make a definition which won’t seem to mean anything at first.

Definition: Let $\mathbf{C}$ be a category. Given a set $A$, the free object over $A$, denoted $F(A)$, is an object of $\mathbf{C}$ which is universal with respect to set-maps $A \to B$ for any object $B$ in $\mathbf{C}$.

As usual with a definition by universal property, we should elaborate as to exactly what’s going on. Let $\mathbf{C}$ be a category whose objects are sets and let $A$ a set, possibly not in this category. We can construct a new category, $\mathbf{C}_{A \to X}$ whose objects are set-maps

$\displaystyle f: A \to X$

and whose morphisms are commutative diagrams of the form

where $\varphi$ is a morphism in $\mathbf{C}$. In other words, we simply require that $\varphi(f_1(a)) = f_2(a)$ for every $a \in A$.

By saying that the free monoid on $A$ satisfies this universal property for the category of monoids, we really mean that it is initial in this category of set-maps and commutative diagrams. That is, there is a monoid $F(A)$ and a set-map $i_A: A \to F(A)$ so that for every monoid $Y$ and set-map $f: A \to Y$ there is a unique monoid homomorphism from $i_A$ to $f$. In other words, there is a unique monoid homomorphism $\varphi$ making the following diagram commute:

For example, if $A$ is the set of all unicode characters, then $F(A)$ is precisely the monoid of all strings on those characters. The map $i_A$ is then the map taking a character $a$ to the single-character string “$a$“. More generally, if $T$ is any type in the category of types and computable functions, then $F(T)$ is the type of homogeneous lists whose elements have type $T$.

We will now restrict our attention to lists and types, and we will denote the free (list) monoid on a type $A$ as $[A]$. The universal property of map is then easy to describe, it’s just a specific instance of this more general universal property for the free monoids. Specifically, we work in the sub-category of homogeneous list monoids (we only allow objects which are $[T]$ for some type $T$). The morphism $i_A: A \to [A]$ just takes a value $a$ of type $A$ and spits out a single-item list $[a]$. We might hope that for any mapping $A \to [B]$, there is a unique monoid homomorphism $[A] \to [B]$ making the following diagram commute.

And while this is true, the diagram lies because “map” does not achieve what $\varphi$ does. The map $f$ might send all of $A$ to the empty list, and this would cause $\varphi$ to be the trivial map. But if we decomposed $f$ further to require it to send $a$ to a single $b$, such as

then things work out nicely. In particular, specifying a function $f: A \to B$ uniquely defines how a list homomorphism operates on lists. In particular, the diagram forces the behavior for single-element lists: $[a] \mapsto [f(a)]$. And the property of $\varphi$ being a monoid homomorphism forces $\varphi$ to preserve order.

Indeed, we’ve learned from this analysis that the structure of the list involved is crucial in forming the universal property. Map is far from the only computable function making the diagram commute, but it is clearly the only monoid homomorphism. As we’ll see, specifying the order of operations is more generally described by fold, and we’ll be able to restate the universal property of map without relying on monoid homomorphisms.

## Filter

The filter function is a small step up in complexity from map, but its universal property is almost exactly the same. Again filter can be viewed through the lens of a universal property for list monoids, because filter itself takes a predicate and produces a list monoid. We know this because filtering two lists by the same predicate and concatenating them is the same as concatenating them first and then filtering them. Indeed, the only difference here is that the diagram now looks like this

where $p$ is our predicate, and $j_A$ is defined by $(a, T) \mapsto [a]$ and $(a,F) \mapsto []$. The composition $j_A \circ p$ gives the map $A \to [A]$ that we can extend uniquely to a monoid homomorphism $[A] \to [A]$. We won’t say any more on it, except to mention that again maintaining the order of the list is better described via fold.

## Fold, and its Universal Property

Fold is different from map and filter in a very concrete way. Even though map and filter do specify that the order of the list should be preserved, it’s not an important part of their definition: filter should filter, and map should map, but no interaction occurs between different entries during the computation. In a sense, we got lucky that having everything be monoid homomorphisms resulted in preserving order without our specific intention.

Because it doesn’t necessarily produce lists and the operation can be quite weird, fold cannot exist without an explicit description of the order of operations. Let’s recall the definition of fold, and stick to the “right associative” order of operations sometimes specified by calling the function “foldr.” We will use foldr and fold interchangeably.

fun foldr(_, v, nil) = v


Even more, we can’t talk about foldr in terms of monoids. Even after fixing $f$ and $v$, foldr need not produce a monoid homomorphism. So if we’re going to find a universal property of foldr, we’ll need a more general categorical picture.

One first try would be to view foldr as a morphism of sets

$\displaystyle B \times \textup{Hom}(A \times B, B) \to \textup{Hom}([A], B)$

The mapping is just $f,b \mapsto \textup{foldr } f \textup{ } b$, and this is just the code definition we gave above. One might hope that this mapping defines an isomorphism in the category of types and programs, but it’s easy to see that it does not. For example, let

$A = \left \{ 1 \right \}, B = \left \{ 1,2 \right \}$

Then the left hand side of the mapping above is a set of size 8 (there are eight ways to combine a choice of element in $B$ with a map from $A \times B \to B$). But the right hand size is clearly infinite. In fact, it’s uncountably infinite, though not all possible mappings are realizable in programs of a reasonable length (in fact, very few are). So the morphism can’t possibly be a surjective and hence is not an isomorphism.

So what can we say about fold? The answer is a bit abstract, but it works out nicely.

Say we fix a type for our lists, $A$. We can define a category $\mathbf{C}_A$ which has as objects the following morphisms

By $1$ we mean the type with one value (null, if you wish), and $f$ is a morphism from a coproduct (i.e. there are implicit parentheses around $A \times B$). As we saw in our post on universal properties, a morphism from a coproduct is the same thing as a pair of functions which operates on each piece. Here one operates on $1$ and the other on $A \times B$. Since morphisms from $1$ are specified by the image of the sole value $f(1) = b$, we will often write such a function as $b \amalg h$, where $h: A \times B \to B$.

Now the morphisms in $\mathbf{C}_A$ are the pairs of morphisms $\varphi, \overline{\varphi}$ which make the following diagram commute:

Here we write $\overline{\varphi}$ because it is determined by $\varphi$ in a canonical way. It maps $1 \to 1$ in the only way one can, and it maps $(a,b) \mapsto (a, \varphi(b))$. So we’re really specifying $\varphi$ alone, but as we’ll see it’s necessary to include $\overline{\varphi}$ as well; it will provide the “inductive step” in the definition of fold.

Now it turns out that fold satisfies the universal property of being initial in this category. Well, not quite. We’re saying first that the following object $\mathscr{A}$ is initial,

Where cons is the list constructor $A \times [A] \to [A]$ which sends $(a_0, [a_1, \dots, a_n]) \mapsto [a_0, a_1, \dots, a_n]$. By being initial, we mean that for any object $\mathscr{B}$ given by the morphism $b \amalg g: 1 \coprod A \times B \to B$, there is a unique morphism from $\mathscr{A} \to \mathscr{B}$. The “fold” is precisely this unique morphism, which we denote by $(\textup{fold g, b})$. We implicitly know its “barred” counterpart making the following diagram commute.

This diagram has a lot going on in it, so let’s go ahead and recap. The left column represents an object $\mathscr{A}$ we’re claiming is initial in this crazy category we’ve made. The right hand side is an arbitrary object $\mathscr{B}$, which is equivalently the data of an element $b \in B$ and a mapping $g: A \times B \to B$. This is precisely the data needed to define fold. The dashed lines represent the unique morphisms making the diagram commute, whose existence and uniqueness is the defining property of what it means for an object to be initial in this category. Finally, we’re claiming that foldr, when we fix $g$ and $b$, makes this diagram commute, and is hence the very same unique morphism we seek.

To prove all of this, we need to first show that the object $\mathscr{A}$ is initial. That is, that any two morphisms we pick from $\mathscr{A} \to \mathscr{B}$ must be equal. The first thing to notice is that the two objects $1 \coprod A \times [A]$ and $[A]$ are really the same thing. That is, $\textup{nil} \amalg \textup{cons}$ has a two-sided inverse which makes it an isomorphism in the category of types. Specifically, the inverse is the map $\textup{split}$ sending

$\textup{nil} \mapsto \textup{nil}$
$[a_1, \dots, a_n] \mapsto (a_1, [a_2, \dots, a_n])$

So if we have a morphism $\varphi, \overline{\varphi}$ from $\mathscr{A} \to \mathscr{B}$, and the diagram commutes, we can see that $\varphi = (b \amalg g) \circ \overline{\varphi} \circ \textup{split}$. We’re just going the long way around the diagram.

Supposing that we have two such morphisms $\varphi_1, \varphi_2$, we can prove they are equal by induction on the length of the input list. It is trivial to see that they both must send the empty list to $b$. Now suppose that for lists of length $n-1$ the two are equal. Given a list $[a_1, \dots, a_n]$ we see that

$\displaystyle \varphi_1([a_1, \dots, a_n]) = \varphi_1 \circ \textup{cons} (a_1, [a_2, \dots, a_n]) = g \circ \overline{\varphi_1} (a_1, [a_2, \dots, a_n])$

where the last equality holds by the commutativity of the diagram. Continuing,

$\displaystyle g \circ \overline{\varphi_1} (a_1, [a_2, \dots, a_n]) = g (a_1, \varphi_1([a_2, \dots, a_n])) = g(a_1, \varphi_2(a_2, \dots, a_n))$

where the last equality holds by the inductive hypothesis. From here, we can reverse the equalities using $\varphi_2$ and it’s “barred” version to get back to $\varphi_2([a_1, \dots, a_n])$, proving the equality.

To show that fold actually makes the diagram commute is even simpler. In order to make the diagram commute we need to send the empty list to $b$, and we need to inductively send $[a_1, \dots, a_n]$ to $g(a_1, (\textup{fold g b})([a_2, \dots, a_n]))$, but this is the very definition of foldr!

So we’ve established that fold has this universal property. It’s easy now to see how map and filter fit into the picture. For mapping types $A$ to $C$ via $f$, just use the type $[C]$ in place of $B$ above, and have $g(a, L) = \textup{cons}(f(a), L)$, and have $b$ be the empty list. Filter similarly just needs a special definition for $g$.

A skeptical reader might ask: what does all of this give us? It’s a good question, and one that shouldn’t be taken lightly because I don’t have an immediate answer. I do believe that with some extra work we could use universal properties to give a trivial proof of the third homomorphism theorem for lists, which says that any function expressible as both a foldr and a foldl can be expressed as a list homomorphism. The proof would involve formulating a universal property for foldl, which is very similar to the one for foldr, and attaching the diagrams in a clever way to give the universal property of a monoid homomorphism for lists. Caveat emptor: this author has not written such a proof, but it seems likely that it would work.

More generally, any time we can represent the requirements of a list computation by an object like $\mathscr{B}$, we can represent the computation as a foldr. What good does this do us? Well, it might already be obvious when you can and can’t use fold. But in addition to proving when it’s possible to use fold, this new perspective generalizes very nicely to give us a characterization of arbitrary computations on compound data types. One might want to know when to perform fold-like operations on trees, or other sufficiently complicated beasts, and the universal property gives us a way to see when such a computation is possible.

That’s right, I said it: there’s more to the world than lists. Shun me if you must, but I will continue dream of great things.

In an effort to make the egregiously long posts on this blog slightly shorter, we’ll postpone our generalization of the universal property of fold until next time. There we’ll define “initial algebras” and show how to characterize “fold-like” computations any compound data type.

Until then!

# Anti-Coordination Games and Stable Graph Colorings

## My First Paper

I’m pleased to announce that my first paper, titled “Anti-Coordination Games and Stable Colorings,” has been accepted for publication! The venue is the Symposium on Algorithmic Game Theory, which will take place in Aachen, Germany this October. A professor of mine once told me that everyone puts their first few publications on a pedestal, so I’ll do my best to keep things down to earth by focusing on the contents of the paper and not my swirling cocktail of pride. The point of this post is to explain the results of my work to a non-research-level audience; the research level folks will likely feel more comfortable reading the paper itself. So here we’ll spend significantly longer explaining the proofs and the concepts, and significantly less time on previous work.

I will assume familiarity with basic graph theory (we have a gentle introduction to that here) and NP-completeness proofs (again, see our primer). We’ll give a quick reminder about the latter when we get to it.

## Anti-Coordination Games on Graphs

The central question in the paper is how to find stable strategy profiles for anti-coordination games played on graphs. This section will flush out exactly what all of that means.

The easiest way to understand the game is in terms of fashion. Imagine there is a group of people. Every day they choose their outfits individually and interact with their friends. If any pair of friends happens to choose the same clothing, then they both suffer some embarrassment. We can alternatively say that whenever two friends anti-coordinate their outfits, they each get some kind of reward. If not being embarrassed is your kind of reward, then these really are equivalent. Not every pair of people are friends, so perhaps the most important aspect of this problem is how the particular friendship network considered affects their interactions. This kind of game is called an anti-coordination game, and the network of friends makes it a “game on a graph.” We’ll make this more rigorous shortly.

We can ask questions like, if everyone is acting independently and greedily will their choices converge over time to a single choice of outfit? If so how quickly? How much better could a centralized fashion-planner who knows the entire friendship network fare in choosing outfits? Is the problem of finding a best strategy for picking outfits computationally hard? What if some pairs of people want to coordinate their outfits and others don’t? What if caring about another’s fashion is only one-sided in some cases?

Already this problem is rooted in the theory of social networks, but the concept of an anti-coordination game played on a graph is quite broad, and the relevance of this model to the real world comes from the generality of a graph. For example, one may consider the trading networks of various countries; in this case not all countries are trading partners, and it is beneficial to produce different commodities than your trading partners so that you actually benefit from the interaction. Likewise, neighboring radio towers want to emit signals on differing wavelengths to minimize interference, and commuters want to pick different roadways to minimize traffic. These are all examples of this model which we’re about to formalize.

In place of our “network of friends,” the game entails a graph $G = (V,E)$ in which each player is represented by a vertex, and there is an edge between two vertices whenever the corresponding players are trying to anti-coordinate. We will use the terms player and vertex interchangeably. For now the graph is undirected, but later we will relax this assumption and work with directed graphs. In place of “outfits” we’ll have a generic set of strategies denoted by the numbers $1, \dots, k$, and each vertex will choose a strategy from this set. In one round of the game, each vertex $v$ chooses a strategy, and this defines a function $f : V \to \left \{ 1, \dots, k \right \}$ from the set of vertices to the set of strategies. Then the payoff of a vertex $v$ in a round, which we denote $\mu_f(v)$, is the number of neighbors of $v$ which have chosen a different strategy than $v$. That is, it is

$\displaystyle \mu_f(v) = \sum_{(v,w) \in E} \mathbf{1}_{\left \{ f(v) \neq f(w) \right \}}$

Where $\mathbf{1}_{A}$ denotes the indicator function for the event $A$, which assumes a value of 1 when the event occurs and 0 otherwise. Here is an example of an instance of the game. We have three strategies, denoted by colors, and the payoff for the vertex labeled $v$ is three.

If this game is played over many many rounds, we can ask if there is a so-called Nash equilibrium. That is, is there a choice of strategies for the players so that no single player will have an incentive to change in future rounds, assuming nobody else changes? We restrict even further to thinking about pure strategy Nash equilibria, which means there are no probabilistic choices made in choosing a strategy. Equivalently, a pure strategy equilibrium is just a choice of a strategy for each vertex which doesn’t change across rounds. In terms of the graph, we call a strategy function $f$ which is a Nash equilibrium a stable equilibrium (or, as will be made clear in the next paragraph, a stable coloring). It must satisfy the property that no vertex can increase its payoff by switching to a different strategy. So our question now becomes: how can we find a stable coloring which as good as possible for all players involved? Slightly more generally, we call a Nash equilibrium a strictly stable equilibrium (or a strictly stable coloring) if every vertex would strictly decrease its payoff by switching to another strategy. As opposed to a plain old stable coloring where one could have the same payoff by switching strategies, if any player tries to switch strategy then it will get a necessarily worse payoff. Though it’s not at all clear now, we will see that this distinction is the difference between computational tractability and infeasibility.

We can see a very clear connection between this game and graph coloring. Here an edge produces a payoff of 1 for each of its two vertices if and only if it’s properly colored. And so if the strategy choice function $f$ is also a proper coloring, this will produce the largest possible payoff for all vertices in the graph. But it may not be the case that (for a fixed set of strategies) the graph is properly colorable, and we already know that finding a proper coloring with more than two colors is a computationally hard problem. So this isn’t a viable avenue for solving our fashion game. In any case, the connection confuses us enough to interchangeably call the strategy choice function $f$coloring of $G$.

As an interesting side note, a slight variation of this game was actually tested on humans (with money as payoff!) to see how well they could do. Each human player was only shown the strategies of their neighbors, and received $5 for every round in which they collectively arrived at a proper coloring of the graph. See this article in Science for more details. Since our game allows for the presence of improperly colored edges, we could instead propose to find an assignment of colors to vertices which maximizes the sum of the payoffs of all players. In this vein, we define the social welfare of a graph and a coloring, denoted $W(G,f)$, to be the sum of the payoffs for all vertices $\sum_v \mu_f(v)$. This is a natural quantity one wants to analyze. Unfortunately, even in the case of two strategies, this quantity is computationally difficult (NP-hard) to maximize. It’s a version of the MAX-CUT problem, in which we try to separate the graph into two sets $X, Y$ such that the largest number of edges crosses from $X$ to $Y$. The correspondence between the two problems is seen by having $X$ represent those vertices which get strategy 1 and $Y$ represent strategy 2. So we can’t hope to find an efficient algorithm maximizing social welfare. The next natural question is: can we find stable or strictly stable colorings at all? Will they even necessarily exist? The answers to these questions form the main results of our paper. ## An Algorithm for Stable Colorings, and the Price of Anarchy It turns out that there is a very simple greedy algorithm for finding stable colorings of a graph. We state it in the form of a proposition. By stable $k$-coloring we mean a stable coloring of the graph with $k$ colors (strategies). Proposition: For every graph $G$ and every $k \geq 1$, $G$ admits a stable $k$-coloring, and such a coloring can be found in polynomial time. Proof. The proof operates by using the social welfare function as a so-called potential function. That is, a change in a player’s strategy which results in a higher payoff results in a higher value of the social welfare function. It is easy to see why: if a player $v$ changes to a color that helps him, then it will result in more properly colored edges (adjacent to $v$) than there were before. This means that more of $v$‘s neighbors receive an additional 1 unit of payoff than those that lost 1 as a result of $v$‘s switch. We call a vertex which has the potential to improve its payoff unhappy, and one which cannot improve its payoff happy. And so our algorithm to find a stable coloring simply finds some unhappy vertex, switches its color to the most uncommon color among its neighbors, and repeats the process until all vertices are happy. Indeed, this is a local maximum of the social welfare function, and the very definition of a stable coloring. $\square$ So that was nice, but we might ask: how much worse is the social welfare arrived at by this algorithm than the optimal social welfare? How much do we stand to lose by accepting the condemnation of NP-hardness and settling for the greedy solution we found? More precisely, if we call $Q$ the set of stable colorings and $C$ the set of all possible colorings, what is the value of $\displaystyle \frac{\max_{c' \in C} W(G, c')}{\min_{c \in Q} W(G, c)}$ This is a well-studied quantity in many games, called the price of anarchy. The name comes from the thought: what do we stand to gain by having a central authority, who can see the entire network topology and decide what is best for us, manage our strategies? The alternative (anarchy) is to have each player in the game act as selfishly and rationally as possible without complete information. It turns out that as the number of strategies grows large in our anti-coordination game, there is no price of anarchy. For our game this obviously depends on the choice of graph, but we know what it is and we formally state the result as a proposition: Proposition: For any graph, the price of anarchy for the $k$ strategy anti-coordination game is at most $k/(k-1)$ and this value is actually achieved by some instances of the game. Proof. The pigeonhole principle says that every vertex can always achieve at least a $(k-1)/k$ fraction of its maximum possible payoff. Specifically, if a vertex $v_i$ can’t achieve a proper coloring, then every color must be accounted for among $v_i$‘s neighbors. Choosing the minimally occurring color will give $v_i$ at least a payoff of $d_i(k-1)/k$ where $d_i$ is the number of neighbors of $v_i$. Since every stable coloring has to satisfy the condition that no vertex can do any better than the strategy it already has, even in the worst stable coloring every vertex already has chosen such a minority color. Since the maximum payoff is twice the number of edges $2 |E|$, and the sum of the degrees $\sum_i d_i = 2 |E|$, we have that the price of anarchy is at most $\displaystyle \frac{2|E|}{\frac{k-1}{k} \sum_i d_i} = \frac{k}{k-1}$ Indeed, we can’t do any better than this in general, because the following graph gives an example where the price of anarchy exactly meets this bound. An instance of the anti-coordination game with 5 strategies which meets the upper bound on price of anarchy. This example can easily be generalized to work with arbitrary $k$. We leave the details as an exercise to the reader. $\square$ ## Strictly Stable Colorings are Hard to Find Perhaps surprisingly, the relatively minor change of adding strictness is enough to make computability intractable. We’ll give an explicit proof of this, but first let’s recall what it means to be intractable. Recall that a problem is in NP if there is an efficient (read, polynomial-time) algorithm which can verify a solution to the problem is actually a solution. For example, the problem of proper graph $k$-coloring is in NP, because if someone gives you a purported coloring all you have to do is verify that each of the $O(n^2)$ edges are properly colored. Similarly, the problem of strictly stable coloring is in NP; all one need do is verify that no choice of a different color for any vertex improves its payoff, and it is trivial to come up with an algorithm which checks this. Next, call a problem $A$ NP-hard if a solution to $A$ allows you to solve any problem in NP. More formally, $A$ being NP-hard means that there is a polynomial-time reduction from any problem in NP $B$ to $A$ in the following (rough) sense: there is a polynomial-time computable function (i.e. deterministic program) $f$ which takes inputs for $B$ and transforms them into inputs for $A$ such that: $w$ is a solvable instance of $B$ is if and only if $f(w)$ is solvable for $A$. This is not a completely formal definition (see this primer on NP-completeness for a more serious treatment), but it’s good enough for this post. In order to prove a problem $C$ is NP-hard, all you need to do is come up with a polynomial-time reduction from a known NP-hard problem $A$ to your new problem $C$. The composition of the reduction used for $A$ can be composed with the reduction for $C$ to get a new reduction proving $C$ is NP-hard. Finally, we call a problem NP-complete if it is both in NP and NP-hard. One natural question to ask is: if we don’t already know of any NP-hard problems, how can we prove anything is NP-hard? The answer is: it’s very hard, but it was done once and we don’t need to do it again (but if you really want to, see these notes). As a result, we have generated a huge list of problems that are NP-complete, and unless P = NP none of these algorithms have polynomial-time algorithms to solve them. We need two examples of NP-hard problems for this paper: graph coloring, and boolean satisfiability. Since we assume the reader is familiar with the former, we recall the latter. Given a set of variables $x_i$, we can form a boolean formula over those variables of the form $\varphi = C_1 \wedge C_2 \wedge \dots \wedge C_m$ where each clause $C_i$ is a disjunction of three literals (negated or unnegated variables). For example, $C_i = (x_2 \vee \bar{x_5} \vee \bar{x_9})$ might be one clause. Here interpret a formula as the $x_i$ having the value true or false, the horizontal bars denoting negation, the wedges $\wedge$ meaning “and” and the vees $\vee$ meaning “or.” We call this particular form conjunctive normal form. A formula $\varphi$ is called satisfiable if there is a choice of true and false assignments to the variables which makes the entire logical formula true. The problem of determining whether there is any satisfying assignment of such a formula, called 3-SAT, is NP-hard. Going back to strictly stable equilibria and anti-coordination games, we will prove that the problem of determining whether a graph has a strictly stable coloring with $k$ colors is NP-hard. As a consequence, finding such an equilibrium is NP-hard. Since the problem is also in NP, it is in fact NP-complete. Theorem: For all $k \geq 2$, the problem of determining whether a graph $G$ has a strictly stable coloring with $k$ colors is NP-complete. Proof. The hardest case is $k =2$, but $k \geq 3$ is a nice warmup to understand how a reduction works, so we start there. The $k \geq 3$ part works by reducing from graph coloring. That is, our reduction will take an input to the graph $k$-coloring problem (a graph $G$ whose $k$-colorability is in question) and we produce a graph $G'$ such that $G$ is $k$-colorable if and only if $G'$ has a strictly stable coloring with $k$ colors. Since graph coloring is hard for $k \geq 3$, this will prove our problem is NP-hard. More specifically, we will construct $G'$ in such a way that the strictly stable colorings also happen to be proper colorings! So finding a strictly stable coloring of $G'$ will immediately give us a proper coloring of $G$. The construction of $G'$ is quite straightforward. We start with $G' = G$, and then for each edge $e = (u,v)$ we add a new subgraph which we call $H_e$ that looks like: By $K_{k-2}$ we mean the complete graph on $k-2$ vertices (all possible edges are present), and the vertices $u,v$ are adjacent to all vertices of the $H_e = K_{k-2}$ part. That is, the graph $H_e \cup \left \{ u,v \right \}$ is the complete graph on $k$ vertices. Now if the original graph $G$ was $k$-colorable, then we can use the same colors for the corresponding vertices in $G'$, and extend to a proper coloring (and hence a strictly stable equilibrium) of all of $G'$. Indeed, for any $H_e$ we can use one different color for each vertex of the $K_{k-2}$ part if we don’t use either of the colors used for $u,v$, then we’ll have a proper coloring. On the other hand, if $G'$ has a strictly stable equilibrium, then no edge $e$ which originally came from $G$ can be improperly colored. If some edge $e = (u,v)$ were improperly colored, then there would be some vertex in the corresponding $H_e$ which is not strictly stable. To see this, notice that in the $k$ vertices among $H_e \cup \left \{ u,v \right \}$ there can be at most $k-1$ colors used, and so any vertex will always be able to switch to that color without hurting his payoff. That is, the coloring might be stable, but it won’t be strictly so. So strictly stable colorings are the same as proper colorings, and we already see that the subgraph $G \subset G'$ is $k$-colorable, completing the reduction. Well that was a bit of a cheap trick, but it shows the difficulty of working with strictly stable equilibria: preventing vertices from changing with no penalty is hard! What’s worse is that it’s still hard even if there are only two colors. The reduction here is a lot more complicated, so we’ll give a sketch of how it works. The reduction is from 3-SAT. So given a boolean formula $\varphi = C_1 \wedge \dots \wedge C_m$ we produce a graph $G$ so that $\varphi$ has a satisfying assignment if and only if $G$ has a strictly stable coloring with two colors. The principle part of the reduction is the following gadget which represents the logic inherent in a clause. We pulled the figure directly from our paper, since the caption gives a good explanation of how it works. To reiterate, the two “appendages” labeled by $x$ correspond to the literal $x$, and the choice of colors for these vertices correspond to truth assignments in $\varphi$. In particular, if the two vertices have the same color, then the literal is assigned true. Of course, we have to ensure that any $x$‘s showing up in other clause gadgets agree, and any $\bar{x}$‘s will have opposite truth values. That’s what the following two gadgets do: The gadget on the left enforces x’s to have the same truth assignment across gadgets (otherwise the center vertex won’t be in strict equilibrium). The gadget on the right enforces two literals to be opposites. And if we stitch the clauses together in a particular way (using the two gadgets above) then we will guarantee consistency across all of the literals. All that’s left to check is that the clauses do what they’re supposed to. That is, we need it to be the case that if all of the literals in a clause gadget are “false,” then we can’t complete the coloring to be strictly stable, and otherwise we can. Indeed, the following diagram gives all possible cases of this up to symmetry: The last figure deserves an explanation: if the three literals are all false, then we can pick any color we wish for $v_1$, and its two remaining neighbors must both have the same color (or else $v_1$ is not in strict equilibrium). Call this color $a$, and using the same reasoning call the neighbors of $v_2$ and $v_3$ $b$ and $c$, respectively. Now by the pigeonhole principle, either $a=b, b=c,$ or $b=c$. Suppose without loss of generality that $a=b$, then the edge labeled $(a,b)$ will have the $a$ part not in strict equilibrium (it will have two neighbors of its same color and only one of the other color). This shows that no strict equilibrium can exist. The reduction then works by taking a satisfying assignment for the variables, coloring the literals in $G$ appropriately, and extending to a strictly stable equilibrium of all of $G$. Conversely, if $G$ has a strictly stable coloring, then the literals must be consistent and each clause must be fully colorable, which the above diagram shows is the same as the clauses being satisfiable. So all of $\varphi$ is satisfiable and we’re done (excluding a few additional details we describe in the paper). $\square$ ## Directed Graphs and Cooperation That was the main result of our paper, but we go on to describe some interesting generalizations. Since this post is getting quite long, we’ll just give a quick overview of the interesting parts. The rest of the paper is dedicated to directed graphs, where we define the payoff of a directed edge $(u,v)$ to go to the $u$ player if $u$ and $v$ anti-coordinate, but $v$ gets nothing. Here the computational feasibility is even worse than it was in the undirected case, but the structure is noticeably more interesting. For the former, not only is in NP-hard to compute strictly stable colorings, it’s even NP-hard to do so in the non-strict case! One large part of the reason for this is that stable colorings might not even exist: a directed 3-cycle has no stable equilibrium. We use this fact as a tool in our reductions to prove the following theorem. Theorem: For all $k \geq 2$, determining whether a directed graph has a stable$latex k\$-coloring is NP-complete.

See section 5 of our paper for a full proof.

To address the interesting structure that arises in the directed case, we observe that we can use a directed graph to simulate the desire of one vertex to actually cooperate with another. To see this for two colors, instead of adding an edge $(u,v)$ we add a proxy edge $u'$ and directed edges $(u,u'), (u',v)$. To be in equilibrium, the proxy has no choice but to anti-coordinate with $v$, and this will give $u$ more incentive to cooperate with $v$ by anti-cooperating with its proxy. This can be extended to $k$ colors by using an appropriately (acyclically) directed copy of $K_{k-1}$.

## Thoughts, and Future Work

While the results in this paper are nice, and I’m particularly proud that I came up with a novel NP-hardness reduction, it is unfortunate that the only results in this paper were hardness results. Because of the ubiquity of NP-hard problems, it’s far more impressive to have an algorithm which actually does something (approximate a good solution, do well under some relaxed assumption, do well in expectation with some randomness) than to prove something is NP-hard. To get an idea of the tone set by researchers, NP-hardness results are often called “negative” results (in the sense that they give a “no” answer to the question of whether there is an efficient algorithm) and finding an algorithm that does something is called a positive result. That being said the technique of using two separate vertices to represent a single literal in a reduction proof is a nice trick, and I have since used it in another research paper, so I’m happy with my work.

On the positive side, though, there is some interesting work to be done. We could look at varying types of payoff structures, where instead of a binary payoff it is a function of the colors involved (say, $|i - j|$. Another interesting direction is to consider distributed algorithms (where each player operates independently and in parallel) and see what kinds of approximations of the optimal payoff can be achieved in that setting. Yet another direction favored by a combinatorialist is to generalize the game to hypergraphs, which makes me wonder what type of payoff structure is appropriate (payoff of 1 for a rainbow edge? a non-monochromatic edge?). There is also some more work that can be done in inspecting the relationship between cooperation and anti-cooperation in the directed version. Though I don’t have any immediate open questions about it, it’s a very interesting phenomenon.

In any event, I’m currently scheduled to give three talks about the results in this paper (one at the conference venue in Germany, and two at my department’s seminars). Here’s to starting off my research career!

# The Erdős-Rényi Random Graph

During the 1950′s the famous mathematician Paul Erdős and Alfred Rényi put forth the concept of a random graph and in the subsequent years of study transformed the world of combinatorics. The random graph is the perfect example of a good mathematical definition: it’s simple, has surprisingly intricate structure, and yields many applications.

In this post we’ll explore basic facts about random graphs, slowly detail a proof on their applications to graph theory, and explore their more interesting properties computationally (a prelude to proofs about their structure). We assume the reader is familiar with the definition of a graph, which we’ve written about at length for non-mathematical audiences, and has some familiarity with undergraduate-level probability and combinatorics for the more mathematical sections of the post. We’ll do our best to remind the reader of these prerequisites as we go, and we welcome any clarification questions in the comment section.

## The Erdős-Rényi Model

The definition of a random graph is simple enough that we need not defer it to the technical section of the article.

Definition: Given a positive integer $n$ and a probability value $0 \leq p \leq 1$, define the graph $G(n,p)$ to be the undirected graph on $n$ vertices whose edges are chosen as follows. For all pairs of vertices $v,w$ there is an edge $(v,w)$ with probability $p$.

Of course, there is no single random graph. What we’ve described here is a process for constructing a graph. We create a set of $n$ vertices, and for each possible pair of vertices we flip a coin (often a biased coin) to determine if we should add an edge connecting them. Indeed, every graph can be made by this process if one is sufficiently lucky (or unlucky), but it’s very unlikely that we will have no edges at all if $p$ is large enough. So $G(n,p)$ is really a probability distribution over the set of all possible graphs on $n$ vertices. We commit a horrendous abuse of notation by saying $G$ or $G(n,p)$ is a random graph instead of saying that $G$ is sampled from the distribution. The reader will get used to it in time.

## Why Do We Care?

Random graphs of all sorts (not just Erdős’s model) find applications in two very different worlds. The first is pure combinatorics, and the second is in the analysis of networks of all kinds.

In combinatorics we often wonder if graphs exist with certain properties. For instance, in graph theory we have the notion of graph colorability: can we color the vertices of a graph with $k$ colors so that none of its edges are monochromatic? (See this blog’s primer on graph coloring for more) Indeed, coloring is known to be a very difficult problem on general graphs. The problem of determining whether a graph can be colored with a fixed number of colors has no known efficient algorithm; it is NP-complete. Even worse, much of our intuition about graphs fails for graph coloring. We would expect that sparse-looking graphs can be colored with fewer colors than dense graphs. One naive way to measure sparsity of a graph is to measure the length of its shortest cycle (recall that a cycle is a path which starts and ends at the same vertex). This measurement is called the girth of a graph. But Paul Erdős proved using random graphs, as we will momentarily, that for any choice of integers $g,k$ there are graphs of girth $\geq g$ which cannot be colored with fewer than $k$ colors. Preventing short cycles in graphs doesn’t make coloring easier.

The role that random graphs play in this picture is to give us ways to ensure the existence of graphs with certain properties, even if we don’t know how to construct an example of such a graph. Indeed, for every theorem proved using random graphs, there is a theorem (or open problem) concerning how to algorithmically construct those graphs which are known to exist.

Pure combinatorics may not seem very useful to the real world, but models of random graphs (even those beyond the Erdős-Rényi model) are quite relevant. Here is a simple example. One can take a Facebook user $u$ and form a graph of that users network of immediate friends $N(u)$ (excluding $u$ itself), where vertices are people and two people are connected by an edge if they are mutual friends; call this the user’s friendship neighborhood. It turns out that the characteristics of the average Facebook user’s friendship neighborhood resembles a random graph. So understanding random graphs helps us understand the structure of small networks of friends. If we’re particularly insightful, we can do quite useful things like identify anomalies, such as duplicitous accounts, which deviate quite far from the expected model. They can also help discover trends or identify characteristics that can allow for more accurate ad targeting. For more details on how such an idea is translated into mathematics and code, see Modularity (we plan to talk about modularity on this blog in the near future; lots of great linear algebra there!).

Random graphs, when they exhibit observed phenomena, have important philosophical consequences. From a bird’s-eye view, there are two camps of scientists. The first are those who care about leveraging empirically observed phenomena to solve problems. Many statisticians fit into this realm: they do wonderful magic with data fit to certain distributions, but they often don’t know and don’t care whether the data they use truly has their assumed properties. The other camp is those who want to discover generative models for the data with theoretical principles. This is more like theoretical physics, where we invent an arguably computational notion of gravity whose consequences explain our observations.

For applied purposes, the Erdős-Rényi random graph model is in the second camp. In particular, if something fits in the Erdős-Rényi model, then it’s both highly structured (as we will make clear in the sequel) and “essentially random.” Bringing back the example of Facebook, this says that most people in the average user’s immediate friendship neighborhood are essentially the same and essentially random in their friendships among the friends of $u$. This is not quite correct, but it’s close enough to motivate our investigation of random graph models. Indeed, even Paul Erdős in his landmark paper mentioned that equiprobability among all vertices in a graph is unrealistic. See this survey for a more thorough (and advanced!) overview, and we promise to cover models which better represent social phenomena in the future.

So lets go ahead and look at a technical proof using random graphs from combinatorics, and then write some programs to generate random graphs.

## Girth and Chromatic Number, and Counting Triangles

As a fair warning, this proof has a lot of moving parts. Skip to the next section if you’re eager to see some programs.

Say we have a $k$ and a $g$, and we wonder whether a graph can exist which simultaneously has no cycles of length less than $g$ (the girth) and needs at least $k$ colors to color. The following theorem settles this question affirmatively.  A bit of terminology: the chromatic number of a graph $G$, denoted $\chi(G)$, is the smallest number of colors needed to properly color $G$.

Theorem: For any natural numbers $k,g$, there exist graphs of chromatic number at least $k$ and girth at least $g$.

Proof. Taking our cue from random graphs, let’s see what the probability is that a random graph $G(n,p)$ on $n$ vertices will have our desired properties. Or easier, what’s the chance that it will not have the right properties? This is essentially a fancy counting argument, but it’s nicer if we phrase it in the language of probability theory.

The proof has a few twists and turns for those uninitiated to the probabilistic method of proof. First, we will look at an arbitrary $G(n,p)$ (where $n,p$ are variable) and ask two questions about it: what is the expected number of short cycles, and what is the expected “independence number” (which we will see is related to coloring). We’ll then pick a value of $p$, depending crucially on $n$, which makes both of these expectations small. Next, we’ll use the fact that if the probability that $G(n,p)$ doesn’t have our properties is strictly less than 1, then there has to be some instance in our probability space which has those properties (if no instance had the property, then the probability would be one!). Though we will not know what the graphs look like, their existence is enough to prove the theorem.

So let’s start with cycles. If we’re given a desired girth of $g$, the expected number of cycles of length $\leq g$ in $G(n,p)$ can be bounded by $(np)^{g+1}/(np-1)$. To see this, the two main points are how to count the number of ways to pick $k$ vertices in order to form a cycle, and a typical fact about sums of powers. Indeed, we can think of a cycle of length $k$ as a way to seat a choice of $k$ people around a circular table. There are $\binom{n}{k}$ possible groups of people, and $(k-1)!$ ways to seat one group. If we fix $k$ and let $n$ grow, then the product $\binom{n}{k}(k-1)!$ will eventually be smaller than $n^k$ (actually, this happens almost immediately in almost all cases). For each such choice of an ordering, the probability of the needed edges occurring to form a cycle is $p^j$, since all edges must occur independently of each other with probability $p$.

So the probability that we get a cycle of $j$ vertices is

$\displaystyle \binom{n}{j}(j-1)!p^j$

And by the reasoning above we can bound this by $n^jp^j$. Summing over all numbers $j = 3, \dots, g$ (we are secretly using the union bound), we bound the expected number of cycles of length $\leq g$ from above:

$\displaystyle \sum_{j=3}^g n^j p^j < \sum_{j=0}^g n^j p^j = \frac{(np)^{g+1}}{np - 1}$

Since we want relatively few cycles to occur, we want it to be the case that the last quantity, $(np)^{j+1}/(np-1)$, goes to zero as $n$ goes to infinity. One trick is to pick $p$ depending on $n$. If $p = n^l$, our upper bound becomes $n^{(l+1)(g+1)} / (n^{1+l} - 1)$, and if we want the quantity to tend to zero it must be that $(l+1)(g+1) < 1$. Solving this we get that $-1 < l < \frac{1}{g+1} - 1 < 0$. Pick such an $l$ (it doesn’t matter which), and keep this in mind: for our choice of $p$, the expected number of cycles goes to zero as $n$ tends to infinity.

On the other hand, we want to make sure that such a graph has high chromatic number. To do this we’ll look at a related property: the size of the largest independent set. An independent set of a graph $G = (V,E)$ is a set of vertices $S \subset V$ so that there are no edges in $E$ between vertices of $S$. We call $\alpha(G)$ the size of the largest independent set. The values $\alpha(G)$ and $\chi(G)$ are related, because any time you have an independent set you can color all the vertices with a single color. In particular, this proves the inequality $\chi(G) \alpha(G) \geq n$, the number of vertices of $G$, or equivalently $\chi(G) \geq n / \alpha(G)$. So if we want to ensure $\chi(G)$ is large, it suffices to show $\alpha(G)$ is small (rigorously, $\alpha(G) \leq n / k$ implies $\chi(G) \geq k$).

The expected number of independent sets (again using the union bound) is at most the product of the number of possible independent sets and the probability of one of these having no edges. We let $r$ be arbitrary and look at independent sets of size $r$ Since there are $\binom{n}{r}$ sets and each has a probability $(1-p)^{\binom{r}{2}}$ of being independent, we get the probability that there is an independent set of size $r$ is bounded by

$\displaystyle \textup{P}(\alpha(G) \geq r) \leq \binom{n}{r}(1-p)^{\binom{r}{2}}$

We use the fact that $1-x < e^{-x}$ for all $x$ to translate the $(1-p)$ part. Combining this with the usual $\binom{n}{r} \leq n^r$ we get the probability of having an independent set of size $r$ at most

$\displaystyle \textup{P}(\alpha(G) \geq r) \leq \displaystyle n^re^{-pr(r-1)/2}$

Now again we want to pick $r$ so that this quantity goes to zero asymptotically, and it’s not hard to see that $r = \frac{3}{p}\log(n)$ is good enough. With a little arithmetic we get the probability is at most $n^{(1-a)/2}$, where $a > 1$.

So now we have two statements: the expected number of short cycles goes to zero, and the probability that there is an independent set of size at least $r$ goes to zero. If we pick a large enough $n$, then the expected number of short cycles is less than $n/5$, and using Markov’s inequality we see that the probability that there are more than $n/2$ cycles of length at most $g$ is strictly less than 1/2. At the same time, if we pick a large enough $n$ then $\alpha(G) \geq r$ with probability strictly less than 1/2. Combining these two (once more with the union bound), we get

$\textup{P}(\textup{at least } n/2 \textup{ cycles of length } \leq g \textup{ and } \alpha(G) \geq r) < 1$

Now we can conclude that for all sufficiently large $n$ there has to be a graph on at least $n$ vertices which has neither of these two properties! Pick one and call it $G$. Now $G$ still has cycles of length $\leq g$, but we can fix that by removing a vertex from each short cycle (it doesn’t matter which). Call this new graph $G'$. How does this operation affect independent sets, i.e. what is $\alpha(G')$? Well removing vertices can only decrease the size of the largest independent set. So by our earlier inequality, and calling $n'$ the number of vertices of $G'$, we can make a statement about the chromatic number:

$\displaystyle \chi(G') \geq \frac{n'}{\alpha(G')} \geq \frac{n/2}{\log(n) 3/p} = \frac{n/2}{3n^l \log(n)} = \frac{n^{1-l}}{6 \log(n)}$

Since $-1 < l < 0$ the numerator grows asymptotically faster than the denominator, and so for sufficiently large $n$ the chromatic number will be larger than any $k$ we wish. Hence we have found a graph with girth at least $g$ and chromatic number at least $k$.

$\square$

## Connected Components

The statistical properties of a random graph are often quite easy to reason about. For instance, the degree of each vertex in $G(n,p)$ is $np$ in expectation. Local properties like this are easy, but global properties are a priori very mysterious. One natural question we can ask in this vein is: when is $G(n,p)$ connected? We would very much expect the answer to depend on how $p$ changes in relation to $n$. For instance, $p$ might look like $p(n) = 1/n^2$ or $\log(n) / n$ or something similar. We could ask the following question:

As $n$ tends to infinity, what limiting proportion of random graphs $G(n,p)$ are connected?

Certainly for some $p(n)$ which are egregiously small (for example, $p(n) = 0$), the answer will be that no graphs are connected. On the other extreme, if $p(n) = 1$ then all graphs will be connected (and complete graphs, at that!). So our goal is to study the transition phase between when the graphs are disconnected and when they are connected. A priori this boundary could be a gradual slope, where the proportion grows from zero to one, or it could be a sharp jump. Next time, we’ll formally state and prove the truth, but for now let’s see if we can figure out which answer to expect by writing an exploratory program.

We wrote the code for this post in Python, and as usual it is all available for download on this blog’s Github page.

We start with a very basic Node class to represent each vertex in a graph, and a function to generate random graphs

import random
class Node:
def __init__(self, index):
self.index = index
self.neighbors = []

def __repr__(self):
return repr(self.index)

def randomGraph(n,p):
vertices = [Node(i) for i in range(n)]
edges = [(i,j) for i in xrange(n) for j in xrange(i) if random.random() < p]

for (i,j) in edges:
vertices[i].neighbors.append(vertices[j])
vertices[j].neighbors.append(vertices[i])

return vertices


The randomGraph function simply creates a list of edges chosen uniformly at random from all possible edges, and then constructs the corresponding graph. Next we have a familiar sight: the depth-first search. We use it to compute the graph components one by one (until all vertices have been found in some run of a DFS).

def dfsComponent(node, visited):
for v in node.neighbors:
if v not in visited:
dfsComponent(v, visited)

def connectedComponents(vertices):
components = []
cumulativeVisited = set()

for v in vertices:
if v not in cumulativeVisited:
componentVisited = set([v])
dfsComponent(v, componentVisited)

components.append(componentVisited)
cumulativeVisited |= componentVisited

return components


The dfsComponent function simply searches in breadth-first fashion, adding every vertex it finds to the “visited” set.  The connectedComponents function keeps track of the list of components found so far, as well as the cumulative set of all vertices found in any run of bfsComponent. Hence, as we iterate through the vertices we can ignore vertices we’ve found in previous runs of bfsComponent. The “x |= y” notation is python shorthand for updating x via a union with y.

Finally, we can make a graph of the largest component of (independently generated) random graphs as the probability of an edge varies.

def sizeOfLargestComponent(vertices):
return max(len(c) for c in connectedComponents(vertices))

def graphLargestComponentSize(n, theRange):
return [(p, sizeOfLargestComponent(randomGraph(n, p))) for p in theRange]


Running this code and plotting it for $p$ varying from zero to 0.5 gives the following graph.

The blue line is the size of the largest component, and the red line gives a moving average estimate of the data.  As we can see, there is a very sharp jump peaking at $p=0.1$ at which point the whole graph is connected. It would appear there is a relatively quick “phase transition” happening here. Let’s zoom in closer on the interesting part.

It looks like the transition begins around 0.02, and finishes at around 0.1. Interesting… Let’s change the parameters a bit, and increase the size of the graph. Here’s the same chart (in the same range of $p$ values) for a graph with a hundred vertices.

Now the phase transition appears to have shifted to about $(0.01, 0.05)$, which is about multiplying the endpoints of the phase transition interval above by 1/2. The plot thickens… Once more, let’s move up to a graph on 500 vertices.

Again it’s too hard to see, so let’s zoom in.

This one looks like the transition starts at 0.002 and ends at 0.01. This is a 5-fold decrease from the previous one, and we increased the number of vertices by 5. Could this be a pattern? Here’s a conjecture to formalize it:

Conjecture: The random graph $G(n,p)$ enters a phase transition at $p=1/n$ and becomes connected almost surely at $p=5/n$.

This is not quite rigorous enough to be a true conjecture, but it sums up our intuition that we’ve learned so far. Just to back this up even further, here’s an animation showing the progression of the phase transition as $n = 20 \dots 500$ in steps of twenty. Note that the $p$ range is changing to maintain our conjectured window.

Looks pretty good. Next time we’ll see some formal mathematics validating our intuition (albeit reformulated in a nicer way), and we’ll continue to investigate other random graph models.

Until then!