The Giant Component and Explosive Percolation

Last time we left off with a tantalizing conjecture: a random graph with edge probability p = 5/n is almost surely a connected graph. We arrived at that conjecture from some ad-hoc data analysis, so let’s go back and treat it with some more rigorous mathematical techniques. As we do, we’ll discover some very interesting “threshold theorems” that essentially say a random graph will either certainly have a property, or it will certainly not have it.

phase-transition-n-grows

The phase transition we empirically observed from last time.

Big components

Recalling the basic definition: an Erdős-Rényi (ER) random graph with n vertices and edge probability p is a probability distribution over all graphs on n vertices. Generatively, you draw from an ER distribution by flipping a p-biased coin for each pair of vertices, and adding the edge if you flip heads. We call the random event of drawing a graph from this distribution a “random graph” even though it’s not a graph, and we denote an ER random graph by G(n,p). When p = 1/2, the distribution G(n,1/2) is the uniform distribution over all graphs on n vertices.

Now let’s get to some theorems. The main tools we’ll use are called the first and second moment method. Let’s illustrate them by example.

The first moment method

Say we want to know what values of p are likely to produce graphs with isolated vertices (vertices with no neighbors), and which are not. Of course, the value of p will depend on n \to \infty in general, but we can already see by example that if p = 1/2 then the probability of a fixed vertex being isolated is 2^{-n} \to 0. We can use the union bound (sum this value over all vertices) to show that the probability of any vertex being isolated is at most n2^{-n} which also tends to zero very quickly. This is not the first moment method, I’m just making the point that all of our results will be interpreted asymptotically as n \to \infty.

So now we can ask: what is the expected number of isolated vertices? If I call X the random variable that counts the expected number of isolated vertices, then I’m asking about \mathbb{E}[X]. Really what I’m doing is interpreting X as a random variable depending on n, p(n), and asking about the evolution of \mathbb{E}[X] as n \to \infty.

Now the first moment method states, somewhat obviously, that if the expectation tends to zero then the value of X itself also tends to zero. Indeed, this follows from Markov’s inequality, which states that the probability that X \geq a is bounded by \mathbb{E}[X]/a. In symbols,

\displaystyle \Pr[X \geq a] \leq \frac{\mathbb{E}[X]}{a}.

In our case X is counting something (it’s integer valued), so asking whether X > 0 is equivalent to asking whether X \geq 1. The upper bound on the probability of X being strictly positive is then just \mathbb{E}[X].

So let’s find out when the expected number of isolated vertices goes to zero. We’ll use the wondrous linearity of expectation to split X into a sum of counts for each vertex. That is, if X_i is 1 when vertex i is isolated and 0 otherwise (this is called an indicator variable), then X = \sum_{i=1}^n X_i and linearity of expectation gives

\displaystyle \mathbb{E}[X] = \mathbb{E}[\sum_{i=1}^n X_i] = \sum_{i=1}^n \mathbb{E}[X_i]

Now the expectation of an indicator random variable is just the probability that the event occurs (it’s trivial to check). It’s easy to compute the probability that a vertex is isolated: it’s (1-p)^n. So the sum above works out to be n(1-p)^n. It should really be n(1-p)^{n-1} but the extra factor of (1-p) doesn’t change anything. The question is what’s the “smallest” way to set p as a function of n in order to make the above thing go to zero? Using the fact that (1-x) < e^{-x} for all x > 0, we get

n(1-p)^n < ne^{-pn}

And setting p = (\log n) / n simplifies the right hand side to ne^{- \log n} = n / n = 1. This is almost what we want, so let’s set p to be anything that grows asymptotically faster than (\log n) / n. The notation for this is \omega((\log n) / n). Then using some slick asymptotic notation we can prove that the RHS of the inequality above goes to zero, and so the LHS must as well. Back to the big picture: we just showed that the expectation of X (the expected number of isolated vertices) goes to zero, and so by the first moment method the value of X (the actual number of isolated vertices) has to go to zero with probability tending to 1.

Some quick interpretations: when p = (\log n) / n each vertex has \log n neighbors in expectation. Moreover, having no isolated vertices is just a little bit short of the entire graph being connected (our ultimate goal is to figure out exactly when this happens). But already we can see that our conjecture from the beginning is probably false: we aren’t able to use this same method to show that when p = c/n for some constant c rules out isolated vertices as n \to \infty. We just got lucky in our data analysis that 5 is about the natural log of 100 (which is 4.6).

The second moment method

Now what about the other side of the coin? If p is asymptotically less than (\log n) / n do we necessarily get isolated vertices? That would really put our conjecture to rest. In this case the answer is yes, but it might not be in general. Let’s discuss.

We said that in general if \mathbb{E}[X] \to 0 then the value of X has to go to zero too (that’s the first moment method). The flip side of this is: if \mathbb{E}[X] \to \infty does necessarily the value of X also tend to infinity? The answer is not always yes. Here is a gruesome example I originally heard from a book: say X is the number of people that will die in the next decade due to an asteroid hitting the earth. The probability that the event happens is quite small, but if it does happen then the number of people that will die is quite large. It is perfectly reasonable for this to drag up the expectation (as the world population grows every decade), but at least we hope a growing population doesn’t by itself increase the value of X.

Mathematics is on our side here. We’re asking under what conditions on \mathbb{E}[X] does the following implication hold: \mathbb{E}[X] \to \infty implies \Pr[X > 0] \to 1.

With the first moment method we used Markov’s inequality (a statement about expectation, also called the first moment). With the second moment method we’ll use a statement about the second moment (variances), and the most common is Chebyshev’s inequality. Chebyshev’s inequality states that the probability X deviates from its expectation by more than c is bounded by \textup{Var}[X] / c^2. In symbols, for all c > 0 we have

\displaystyle \Pr[|X - \mathbb{E}[X]| \geq c] \leq \frac{\textup{Var}[X]}{c^2}

Now the opposite of X > 0, written in terms of deviation from expectation, is |X - \mathbb{E}[X]| \geq \mathbb{E}[X]. In words, in order for any number a to be zero, it has to have a distance of at least b from any number b. It’s such a stupidly simple statement it’s almost confusing. So then we’re saying that

\displaystyle \Pr[X = 0] \leq \frac{\textup{Var}[X]}{\mathbb{E}[X]^2}.

In order to make this probability go to zero, it’s enough to have \textup{Var}[X] = o(\mathbb{E}[X]^2). Again, the little-o means “grows asymptotically slower than.” So the numerator of the fraction on the RHS will grow asymptotically slower than the denominator, meaning the whole fraction tends to zero. This condition and its implication are together called the “second moment method.”

Great! So we just need to compute \textup{Var}[X] and check what conditions on p make it fit the theorem. Recall that \textup{Var}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2, and we want to upper bound this in terms of \mathbb{E}[X]^2. Let’s compute \mathbb{E}[X]^2 first.

\displaystyle \mathbb{E}[X]^2 = n^2(1-p)^{2n}

Now the variance.

\displaystyle \textup{Var}[X] = \mathbb{E}[X^2] - n^2(1-p)^{2n}

Expanding X as a sum of indicator variables X_i for each vertex, we can split the square into a sum over pairs. Note that X_i^2 = X_i since they are 0-1 valued indicator variables, and X_iX_j is the indicator variable for both events happening simultaneously.

\displaystyle \begin{aligned} \mathbb{E}[X^2] &= \mathbb{E}[\sum_{i,j} X_{i,j}] \\ &=\mathbb{E} \left [ \sum_i X_i^2 + \sum_{i \neq j} X_iX_j \right ] \\ &= \sum_i \mathbb{E}[X_i^2] + \sum_{i \neq j} \mathbb{E}[X_iX_j] \end{aligned}

By what we said about indicators, the last line is just

\displaystyle \sum_i \Pr[i \textup{ is isolated}] + \sum_{i \neq j} \Pr[i,j \textup{ are both isolated}]

And we can compute each of these pieces quite easily. They are (asymptotically ignoring some constants):

\displaystyle n(1-p)^n + n^2(1-p)(1-p)^{2n-4}

Now combining the two terms together (subtracting off the square of the expectation),

\displaystyle \begin{aligned} \textup{Var}[X] &\leq n(1-p)^n + n^2(1-p)^{-3}(1-p)^{2n} - n^2(1-p)^{2n} \\ &= n(1-p)^n + n^2(1-p)^{2n} \left ( (1-p)^{-3} - 1 \right ) \end{aligned}

Now we divide by \mathbb{E}[X]^2 to get n^{-1}(1-p)^{-n} + (1-p)^{-3} - 1. Since we’re trying to see if p = (\log n) / n is a sharp threshold, the natural choice is to let p = o((\log n) / n). Indeed, using the 1-x < e^{-x} upper bound and plugging in the little-o bounds the whole quantity by

\displaystyle \frac{1}{n}e^{o(\log n)} + o(n^{1/n}) - 1 = o(1)

i.e., the whole thing tends to zero, as desired.

Other thresholds

So we just showed that the property of having no isolated vertices in a random graph has a sharp threshold at p = (\log n) / n. Meaning at any larger probability the graph is almost surely devoid of isolated vertices, and at any lower probability the graph almost surely has some isolated vertices.

This might seem like a miracle theorem, but there turns out to be similar theorems for lots of properties. Most of them you can also prove using basically the same method we’ve been using here. I’ll list some below. Also note they are all sharp, two-sided thresholds in the same way that the isolated vertex boundary is.

  • The existence of a component of size \omega(\log (n)) has a threshold of 1/n.
  • p = c/n for any c > 0 is a threshold for the existence of a giant component of linear size \Theta(n). Moreover, above this threshold no other components will have size \omega(\log n).
  • In addition to (\log n) / n being a threshold for having no isolated vertices, it is also a threshold for connectivity.
  • p = (\log n + \log \log n + c(n)) / n is a sharp threshold for the existence of Hamiltonian cycles in the following sense: if c(n) = \omega(1) then there will be a Hamilton cycle almost surely, if c(n) \to -\infty there will be no Hamiltonian cycle almost surely, and if c(n) \to c the probability of a Hamiltonian cycle is e^{-e^{-c}}. This was proved by Kolmos and Szemeredi in 1983. Moreover, there is an efficient algorithm to find Hamiltonian cycles in these random graphs when they exist with high probability.

Explosive Percolation

So now we know that as the probability of an edge increases, at some point the graph will spontaneously become connected; at some time that is roughly \log(n) before, the so-called “giant component” will emerge and quickly engulf the entire graph.

Here’s a different perspective on this situation originally set forth by Achlioptas, D’Souza, and Spencer in 2009. It has since become called an “Achlioptas process.”

The idea is that you are watching a random graph grow. Rather than think about random graphs as having a probability above or below some threshold, you can think of it as the number of edges growing (so the thresholds will all be multiplied by n). Then you can imagine that you start with an empty graph, and at every time step someone is adding a new random edge to your graph. Fine, eventually you’ll get so many edges that a giant component emerges and you can measure when that happens.

But now imagine that instead of being given a single random new edge, you are given a choice. Say God presents you with two random edges, and you must pick which to add to your graph. Obviously you will eventually still get a giant component, but the question is how long can you prevent it from occurring? That is, how far back can we push the threshold for connectedness by cleverly selecting the new edge?

What Achlioptas and company conjectured was that you can push it back (some), but that when you push it back as far as it can go, the threshold becomes discontinuous. That is, they believed there was a constant \delta \geq 1/2 such that the size of the largest component jumps from o(n) to \delta n in o(n) steps.

This turned out to be false, and Riordan and Warnke proved it. Nevertheless, the idea has been interpreted in an interesting light. People have claimed it is a useful model of disaster in the following sense. If you imagine that an edge between two vertices is a “crisis” relating two entities. Then in every step God presents you with two crises and you only have the resources to fix one. The idea is that when the entire graph is connected, you have this one big disaster where all the problems are interacting with each other. The percolation process describes how long you can “survive” while avoiding the big disaster.

There are critiques of this interpretation, though, mainly about how simplistic it is. In particular, an Achlioptas process models a crisis as an exogenous force when in reality problems are usually endogenous. You don’t expect a meteor to hit the Earth, but you do expect humans to have an impact on the environment. Also, not everybody in the network is trying to avoid errors. Some companies thrive in economic downturns by managing your toxic assets, for example. So one could reasonably argue that Achlioptas processes aren’t complex enough to model the realistic types of disasters we face.

Either way, I find it fantastic that something like a random graph (which for decades was securely in pure combinatorics away from applications) is spurring such discussion.

Next time, we’ll take one more dive into the theory of Erdős-Rényi random graphs to prove a very “meta” theorem about sharp thresholds. Then we’ll turn our attention to other models of random graphs, hopefully more realistic ones 🙂

Until then!

When Greedy Algorithms are Good Enough: Submodularity and the (1 – 1/e)-Approximation

Greedy algorithms are among the simplest and most intuitive algorithms known to humans. Their name essentially gives their description: do the thing that looks best right now, and repeat until nothing looks good anymore or you’re forced to stop. Some of the best situations in computer science are also when greedy algorithms are optimal or near-optimal. There is a beautiful theory of this situation, known as the theory of matroids. We haven’t covered matroids on this blog (edit: we did), but in this post we will focus on the next best thing: when the greedy algorithm guarantees a reasonably good approximation to the optimal solution.

This situation isn’t hard to formalize, and we’ll make it as abstract as possible. Say you have a set of objects X, and you’re looking to find the “best” subset S \subset X. Here “best” is just measured by a fixed (known, efficiently computable) objective function f : 2^X \to \mathbb{R}. That is, f accepts as input subsets of X and outputs numbers so that better subsets have larger numbers. Then the goal is to find a subset maximizing f.

In this generality the problem is clearly impossible. You’d have to check all subsets to be sure you didn’t miss the best one. So what conditions do we need on either X or f or both that makes this problem tractable? There are plenty you could try, but one very rich property is submodularity.

The Submodularity Condition

I think the simplest way to explain submodularity is in terms of coverage. Say you’re starting a new radio show and you have to choose which radio stations to broadcast from to reach the largest number of listeners. For simplicity say each radio station has one tower it broadcasts from, and you have a good estimate of the number of listeners you would reach if you broadcast from a given tower. For more simplicity, say it costs the same to broadcast from each tower, and your budget restricts you to a maximum of ten stations to broadcast from. So the question is: how do you pick towers to maximize your overall reach?

The hidden condition here is that some towers overlap in which listeners they reach. So if you broadcast from two towers in the same city, a listener who has access to both will just pick one or the other. In other words, there’s a diminished benefit to picking two overlapping towers if you already have chosen one.

In our version of the problem, picking both of these towers has some small amount of "overkill."

In our version of the problem, picking both of these towers has some small amount of “overkill.”

This “diminishing returns” condition is a general idea you can impose on any function that takes in subsets of a given set and produces numbers. If X is a set, then for a strange reason we denote by 2^X the set of all subsets of X. So we can state this condition more formally,

Definition: Let X be a finite set. A function f: 2^X \to \mathbb{R} is called submodular if for all subsets S \subset T \subset X and all x \in X \setminus T,

\displaystyle f(S \cup \{ x \}) - f(S) \geq f(T \cup \{ x \}) - f(T)

In other words, if f measures “benefit,” then the marginal benefit of adding x to S is at least as high as the marginal benefit of adding it to T. Since S \subset T and x are all arbitrary, this is as general as one could possibly make it: adding x to a bigger set can’t be better than adding it to a smaller set.

Before we start doing things with submodular functions, let’s explore some basic properties. The first is an equivalent definition of submodularity

Proposition: f is submodular if and only if for all A, B \subset X, it holds that

\displaystyle f(A \cap B) + f(A \cup B) \leq f(A) + f(B).

Proof. If we assume f has the condition from this proposition, then we can set A=T, B=S \cup \{ x \}, and the formula just works out. Conversely, if we have the condition from the definition, then using the fact that A \cap B \subset B we can inductively apply the inequality to each element of A \setminus B to get

\displaystyle f(A \cup B) - f(B) \leq f(A) - f(A \cap B)

\square

Next, we can tweak and combine submodular functions to get more submodular functions. In particular, non-negative linear combinations of sub-modular functions are submodular. In other words, if f_1, \dots, f_k are submodular on the same set X, and \alpha_1, \dots, \alpha_k are all non-negative reals, then \alpha_1 f_1 + \dots + \alpha_k f_k is also a submodular function on X. It’s an easy exercise in applying the definition to see why this is true. This is important because when we’re designing objectives to maximize, we can design them by making some simple submodular pieces, and then picking an appropriate combination of those pieces.

The second property we need to impose on a submodular function is monotonicity. That is, as your sets get more elements added to them, their value under f only goes up. In other words, f is monotone when S \subset T then f(S) \leq f(T). An interesting property of functions that are both submodular and monotone is that the truncation of such a function is also submodular and monotone. In other words, \textup{min}(f(S), c) is still submodular when f is monotone submodular and c is a constant.

Submodularity and Monotonicity Give 1 – 1/e

The wonderful thing about submodular functions is that we have a lot of great algorithmic guarantees for working with them. We’ll prove right now that the coverage problem (while it might be hard to solve in general) can be approximated pretty well by the greedy algorithm.

Here’s the algorithmic setup. I give you a finite set X and an efficient black-box to evaluate f(S) for any subset S \subset X you want. I promise you that f is monotone and submodular. Now I give you an integer k between 1 and the size of X, and your task is to quickly find a set S of size k for which f(S) is maximal among all subsets of size k. That is, you design an algorithm that will work for any k, X, f and runs in polynomial time in the sizes of X, k.

In general this problem is NP-hard, meaning you’re not going to find a solution that works in the worst case (if you do, don’t call me; just claim your million dollar prize). So how well can we approximate the optimal value for f(S) by a different set of size k? The beauty is that, if your function is monotone and submodular, you can guarantee to get within 63% of the optimum. The hope (and reality) is that in practice it will often perform much better, but still this is pretty good! More formally,

Theorem: Let f be a monotone, submodular, non-negative function on X. The greedy algorithm, which starts with S as the empty set and at every step picks an element x which maximizes the marginal benefit f(S \cup \{ x \}) - f(S), provides a set S that achieves a (1- 1/e)-approximation of the optimum.

We’ll prove this in just a little bit more generality, and the generality is quite useful. If we call S_1, S_2, \dots, S_l the sets chosen by the greedy algorithm (where now we might run the greedy algorithm for l > k steps), then for all l, k, we have

\displaystyle f(S_l) \geq \left ( 1 - e^{-l/k} \right ) \max_{T: |T| \leq k} f(T)

This allows us to run the algorithm for more than k steps to get a better approximation by sets of larger size, and quantify how much better the guarantee on that approximation would be. It’s like an algorithmic way of hedging your risk. So let’s prove it.

Proof. Let’s set up some notation first. Fix your l and k, call S_i the set chosen by the greedy algorithm at step i, and call S^* the optimal subset of size k. Further call \textup{OPT} the value of the best set f(S^*). Call x_1^*, \dots, x_k^* the elements of S^* (the order is irrelevant). Now for every i < l monotonicity gives us f(S^*) \leq f(S^* \cup S_i). We can unravel this into a sum of marginal gains of adding single elements. The first step is

\displaystyle f(S^* \cup S_i) = f(S^* \cup S_i) - f(\{ x_1^*, \dots, x_{k-1}^* \} \cup S_i) + f(\{ x_1^*, \dots, x_{k-1}^* \} \cup S_i)

The second step removes x_{k-1}^*, from the last term, the third removes x_{k-2}^*, and so on until we have removed all of S^* and get this sum

\displaystyle f(S^* \cup S_i) = f(S_i) + \sum_{j=1}^k \left ( f(S_i \cup \{ x_1^*, \dots, x_j^* \}) - f(S_i \cup \{ x_1^*, \dots, x_{j-1}^* \} ) \right )

Now, applying submodularity, we can change all of these marginal benefits of “adding one more S^* element to S_i already with some S^* stuff” to “adding one more S^* element to just S_i.” In symbols, the equation above is at most

\displaystyle f(S_i) + \sum_{x \in S^*} f(S_i \cup \{ x \}) - f(S_i)

and because S_{i+1} is greedily chosen to maximize the benefit of adding a single element, so the above is at most

\displaystyle f(S_i) + \sum_{x \in S^*} f(S_{i+1}) - f(S_i) = f(S_i) + k(f(S_{i+1}) - f(S_i))

Chaining all of these together, we have f(S^*) - f(S_i) \leq k(f(S_{i+1}) - f(S_i)). If we call a_{i} = f(S^*) - f(S_i), then this inequality can be rewritten as a_{i+1} \leq (1 - 1/k) a_{i}. Now by induction we can relate a_l \leq (1 - 1/k)^l a_0. Now use the fact that a_0 \leq f(S^*) and the common inequality 1-x \leq e^{-x} to get

\displaystyle a_l = f(S^*) - f(S_l) \leq e^{-l/k} f(S^*)

And rearranging gives f(S_l) \geq (1 - e^{-l/k}) f(S^*).

\square

Setting l=k gives the approximation bound we promised. But note that allowing the greedy algorithm to run longer can give much stronger guarantees, though it requires you to sacrifice the cardinality constraint. 1 - 1/e is about 63%, but doubling the size of S gives about an 86% approximation guarantee. This is great for people in the real world, because you can quantify the gains you’d get by relaxing the constraints imposed on you (which are rarely set in stone).

So this is really great! We have quantifiable guarantees on a stupidly simple algorithm, and the setting is super general. And so if you have your problem and you manage to prove your function is submodular (this is often the hardest part), then you are likely to get this nice guarantee.

Extensions and Variations

This result on monotone submodular functions is just one part of a vast literature on finding approximation algorithms for submodular functions in various settings. In closing this post we’ll survey some of the highlights and provide references.

What we did in this post was maximize a monotone submodular function subject to a cardinality constraint |S| \leq k. There are three basic variations we could do: we could drop constraints and see whether we can still get guarantees, we could look at minimization instead of maximization, and we could modify the kinds of constraints we impose on the solution.

There are a ton of different kinds of constraints, and we’ll discuss two. The first is where you need to get a certain value f(S) \geq q, and you want to find the smallest set that achieves this value. Laurence Wolsey (who proved a lot of these theorems) showed in 1982 that a slight variant of the greedy algorithm can achieve a set whose size is a multiplicative factor of 1 + \log (\max_x f(\{ x \})) worse than the optimum.

The second kind of constraint is a generalization of a cardinality constraint called a knapsack constraint. This means that each item x \in X has a cost, and you have a finite budget with which to spend on elements you add to S. One might expect this natural extension of the greedy algorithm to work: pick the element which maximizes the ratio of increasing the value of f to the cost (within your available budget). Unfortunately this algorithm can perform arbitrarily poorly, but there are two fun caveats. The first is that if you do both this augmented greedy algorithm and the greedy algorithm that ignores costs, then at least one of these can’t do too poorly. Specifically, one of them has to get at least a 30% approximation. This was shown by Leskovec et al in 2007. The second is that if you’re willing to spend more time in your greedy step by choosing the best subset of size 3, then you can get back to the 1-1/e approximation. This was shown by Sviridenko in 2004.

Now we could try dropping the monotonicity constraint. In this setting cardinality constraints are also superfluous, because it could be that the very large sets have low values. Now it turns out that if f has no other restrictions (in particular, if it’s allowed to be negative), then even telling whether there’s a set S with f(S) > 0 is NP-hard, but the optimum could be arbitrarily large and positive when it exists. But if you require that f is non-negative, then you can get a 1/3-approximation, if you’re willing to add randomness you can get 2/5 in expectation, and with more subtle constraints you can get up to a 1/2 approximation. Anything better is NP-hard. Fiege, Mirrokni, and Vondrak have a nice FOCS paper on this.

Next, we could remove the monotonicity property and try to minimize the value of f(S). It turns out that this problem always has an efficient solution, but the only algorithm I have heard of to solve it involves a very sophisticated technique called the ellipsoid algorithm. This is heavily related to linear programming and convex optimization, something which I hope to cover in more detail on this blog.

Finally, there are many interesting variations in the algorithmic procedure. For example, one could require that the elements are provided in some order (the streaming setting), and you have to pick at each step whether to put the element in your set or not. Alternatively, the objective functions might not be known ahead of time and you have to try to pick elements to jointly maximize them as they are revealed. These two settings have connections to bandit learning problems, which we’ve covered before on this blog. See this survey of Krause and Golovin for more on the connections, which also contains the main proof used in this post.

Indeed, despite the fact that many of the big results were proved in the 80’s, the analysis of submodular functions is still a big research topic. There was even a paper posted just the other day on the arXiv about its relation to ad serving! And wouldn’t you know, they proved a (1-1/e)-approximation for their setting. There’s just something about 1-1/e.

Until next time!

Learning to Love Complex Numbers

This post is intended for people with a little bit of programming experience and no prior mathematical background.

So let’s talk about numbers.

Numbers are curious things. On one hand, they represent one of the most natural things known to humans, which is quantity. It’s so natural to humans that even newborn babies are in tune with the difference between quantities of objects between 1 and 3, in that they notice when quantity changes much more vividly than other features like color or shape.

But our familiarity with quantity doesn’t change the fact that numbers themselves (as an idea) are a human invention. And they’re not like most human inventions, the kinds where you have to tinker with gears or circuits to get a machine that makes your cappuccino. No, these are mathematical inventions. These inventions exist only in our minds.

Numbers didn’t always exist. A long time ago, back when the Greeks philosophers were doing their philosophizing, negative numbers didn’t exist! In fact, it wasn’t until 1200 AD that the number zero was first considered in Europe. Zero, along with negative numbers and fractions and square roots and all the rest, were invented primarily to help people solve more problems than they could with the numbers they had available. That is, numbers were invented primarily as a way for people to describe their ideas in a useful way. People simply  wondered “is there a number whose square gives you 2?” And after a while they just decided there was and called it \sqrt{2} because they didn’t have a better name for it. 

But with these new solutions came a host of new problems. You see, although I said mathematical inventions only exist in our minds, once they’re invented they gain a life of their own. You start to notice patterns in your mathematical objects and you have to figure out why they do the things they do. And numbers are a perfectly good example of this: once I notice that I can multiply a number by itself, I can ask how often these “perfect squares” occur. That is, what’s the pattern in the numbers 1^2, 2^2, 3^2, 4^2, \dots? If you think about it for a while, you’ll find that square numbers have a very special relationship with odd numbers.

Other times, however, the things you invent turn out to make no sense at all, and you can prove they never existed in the first place! It’s an odd state of affairs, but we’re going to approach the subject of complex numbers from this mindset. We’re going to come up with a simple idea, the idea that negative numbers can be perfect squares, and explore the world of patterns it opens up. Along the way we’ll do a little bit of programming to help explore, give some simple proofs to solidify our intuition, and by the end we’ll see how these ideas can cause wonderful patterns like this one:

mandelbrot

The number i

Let’s bring the story back around to squares. One fact we all remember about numbers is that squaring a number gives you something non-negative. 7^2 = 49, (-2)^2 = 4, 0^2 = 0, and so on. But it certainly doesn’t have to be this way. What if we got sick of that stupid fact and decided to invent a new number whose square was negative? Which negative, you ask? Well it doesn’t really matter, because I can always stretch it larger or smaller so that it’s square is -1.

Let’s see how: if you say that your made-up number x makes x^2 = -7, then I can just use \frac{x}{\sqrt{7}} to get a number whose square is -1. If you’re going to invent a number that’s supposed to interact with our usual numbers, then you have to be allowed to add, subtract, and multiply x with regular old real numbers, and the usual properties would have to still work. So it would have to be true that (x / \sqrt{7})^2 = x^2 / \sqrt{7}^2 = -7/7 = -1.

So because it makes no difference (this is what mathematicians mean by, “without loss of generality”) we can assume that the number we’re inventing will have a square of negative one. Just to line up with history, let’s call the new number i. So there it is: i exists and i^2 = -1. And now that we are “asserting” that i plays nicely with real numbers, we get these natural rules for adding and subtracting and multiplying and dividing. For example

  • 1 + i is a new number, which we’ll just call 1+i. And if we added two of these together, (1+ i) + (1+i), we can combine the real parts and the i parts to get 2 + 2i. Same goes for subtraction. In general a complex number looks like a + bi, because as we’ll see in the other points you can simplify every simple arithmetic expression down to just one “real number” part and one “real number times i” part.
  • We can multiply 3 \cdot i, and we’ll just call it 3i, and we require that multiplication distributes across addition (that the FOIL rule works). So that, for example, (2 - i)(1 + 3i) = (2 + 6i - i - 3i^2) = (2 + 3) + (6i - i) = (5 + 5i).
  • Dividing is a significantly more annoying. Say we want to figure out what 1 / (1+i) is (in fact, it’s not even obvious that this should look like a regular number! But it does). The 1 / a notation just means we’re looking for a number which, when we multiply by the denominator a, we get back to 1. So we’re looking to find out when (a + bi)(1 + i) = 1 + 0i where a and b are variables we’re trying to solve for. If we multiply it out we get (a-b) + (a + b)i = 1 + 0i, and since the real part and the i part have to match up, we know that a - b = 1 and a + b = 0. If we solve these two equations, we find that a = 1/2, b = -1/2 works great. If we want to figure out something like (2 + 3i) / (1 - i), we just find out what 1 / (1- i) is first, and then multiply the result by (2+3i).

So that was tedious and extremely boring, and we imagine you didn’t even read it (that’s okay, it really is boring!). All we’re doing is establishing ground rules for the game, so if you come across some arithmetic that doesn’t make sense, you can refer back to this list to see what’s going on. And once again, for the purpose of this post, we’re asserting that all these laws hold. Maybe some laws follow from others, but as long as we don’t come up with any nasty self-contradictions we’ll be fine.

And now we turn to the real questions: is i the only square root of -1? Does i itself have a square root? If it didn’t, we’d be back to where we started, with some numbers (the non-i numbers) having square roots while others don’t. And so we’d feel the need to make all the i numbers happy by making up more numbers to be their square roots, and then worrying what if these new numbers don’t have square roots and…gah!

I’ll just let you in on the secret to save us from this crisis. It turns out that i does have a square root in terms of other i numbers, but in order to find it we’ll need to understand i from a different angle, and that angle turns out to be geometry.

Geometry? How is geometry going to help me understand numbers!?

It’s a valid question and part of why complex numbers are so fascinating. And I don’t mean geometry like triangles and circles and parallel lines (though there will be much talk of angles), I mean transformations in the sense that we’ll be “stretching,” “squishing,” and “rotating” numbers. Maybe another time I can tell you why for me “geometry” means stretching and rotating; it’s a long but very fun story.

The clever insight is that you can represent complex numbers as geometric objects in the first place. To do it, you just think of a + bi as a pair of numbers (a,b), (the pair of real part and i part), and then plot that point on a plane. For us, the x-axis will be the “real” axis, and the y-axis will be the i-axis. So the number (3 - 4i) is plotted 3 units in the positive x direction and 4 units in the negative y direction. Like this:

single-complex-number

The “j” instead of “i” is not a typo, but a disappointing fact about the programming language we used to make this image. We’ll talk more about why later.

We draw it as an arrow for a good reason. Stretching, squishing, rotating, and reflecting will all be applied to the arrow, keeping its tail fixed at the center of the axes. Sometimes the arrow is called a “vector,” but we won’t use that word because here it’s synonymous with “complex number.”

So let’s get started squishing stuff.

Stretching, Squishing, Rotating

Before we continue I should clear up some names. We call a number that has an i in it a complex number, and we call the part without the i the real part (like 2 in 2-i) and the part with i the complex part.

Python is going to be a great asset for us in exploring complex numbers, so let’s jump right into it. It turns out that Python natively supports complex numbers, and I wrote a program for drawing complex numbers. I used it to make the plot above. The program depends on a library I hate called matplotlib, and so the point of the program is to shield you from as much pain as possible and focus on complex numbers. You can use the program by downloading it from this blog’s Github page, along with everything else I made in writing this post. All you need to know how to do is call a function, and I’ve done a bit of window dressing removal to simplify things (I really hate matplotlib).

Here’s the function header:

# plotComplexNumbers : [complex] -> None
# display a plot of the given list of complex numbers
def plotComplexNumbers(numbers):
   ...

Before we show some examples of how to use it, we have to understand how to use complex numbers in Python. It’s pretty simple, except that Python was written by people who hate math, and so they decided the complex number would be represented by j instead of i (people who hate math are sometimes called “engineers,” and they use j out of spite. Not really, though).

So in Python it’s just like any other computation. For example:

>>> (1 + 1j)*(4 - 2j) == (6+2j)
True
>>> 1 / (1+1j)
(0.5-0.5j)

And so calling the plotting function with a given list of complex numbers is as simple as importing the module and calling the function

from plotcomplex import plot
plot.plotComplexNumbers([(-1+1j), (1+2j), (-1.5 - 0.5j), (.6 - 1.8j)])

Here’s the result

example-complex-plot

So let’s use plots like this one to explore what “multiplication by i” does to a complex number. It might not seem exciting at first, but I promise there’s a neat punchline.

Even without plotting it’s pretty easy to tell what multiplying by i does to some numbers. It takes 1 to i, moves i to i^2 = -1, it takes -1 to -i, and -i to -i \cdot i = 1.

What’s the pattern in these? well if we plot all these numbers, they’re all at right angles in counter-clockwise order. So this might suggest that multiplication by i does some kind of rotation. Is that always the case? Well lets try it with some other more complicated numbers. Click the plots below to enlarge.

Well, it looks close but it’s hard to tell. Some of the axes are squished and stretched, so it might be that our images don’t accurately represent the numbers (the real world can be such a pain). Well when visual techniques fail, we can attempt to prove it.

Clearly multiplying by i does some kind of rotation, maybe with other stuff too, and it shouldn’t be so hard to see that multiplying by i does the same thing no matter which number you use (okay, the skeptical readers will say that’s totally hard to see, but we’ll prove it super rigorously in a minute). So if we take any number and multiply it by i once, then twice, then three times, then four, and if we only get back to where we started at four multiplications, then each rotation had to be a quarter turn.

Indeed,

\displaystyle (a + bi) i^4 = (ai - b) i^3 = (-a - bi) i^2 = (-ai + b) i = a + bi

This still isn’t all that convincing, and we want to be 100% sure we’re right. What we really need is a way to arithmetically compute the angle between two complex numbers in their plotted forms. What we’ll do is find a way to measure the angle of one complex number with the x-axis, and then by subtraction we can get angles between arbitrary points. For example, in the figure below \theta = \theta_1 - \theta_2.

angle-example

One way to do this is with trigonometry: the geometric drawing of a + bi is the hypotenuse of a right triangle with the x-axis.

triangle-example

And so if r is the length of the arrow, then by the definition of sine and cosine, \cos(\theta) = a/r, \sin(\theta) = b/r. If we have r, \theta, and r > 0, we can solve for a unique a and b, so instead of representing a complex number in terms of the pair of numbers (a,b), we can represent it with the pair of numbers (r, \theta). And the conversion between the two is just

a + bi = r \cos(\theta) + (r \sin(\theta)) i

The (r, \theta) representation is called the polar representation, while the (a,b) representation is called the rectangular representation or the Cartesian representation. Converting between polar and Cartesian coordinates fills the pages of many awful pre-calculus textbooks (despite the fact that complex numbers don’t exist in classical calculus). Luckily for us Python has built-in functions to convert between the two representations for us.

>>> import cmath
>>> cmath.polar(1 + 1j)
(1.4142135623730951, 0.7853981633974483)
>>> z = cmath.polar(1 + 1j)
>>> cmath.rect(z[0], z[1])
(1.0000000000000002+1j)

It’s a little bit inaccurate on the rounding, but it’s fine for our purposes.

So how do we compute the angle between two complex numbers? Just convert each to the polar form, and subtract the second coordinates. So if we get back to our true goal, to figure out what multiplication by i does, we can just do everything in polar form. Here’s a program that computes the angle between two complex numbers.

def angleBetween(z, w):
   zPolar, wPolar = cmath.polar(z), cmath.polar(w)
   return wPolar[1] - zPolar[1]

print(angleBetween(1 + 1j, (1 + 1j) * 1j))
print(angleBetween(2 - 3j, (2 - 3j) * 1j))
print(angleBetween(-0.5 + 7j, (-0.5 + 7j) * 1j))

Running it gives

1.5707963267948966
1.5707963267948966
-4.71238898038469

Note that the decimal form of \pi/2 is 1.57079…, and that the negative angle is equivalent to \pi/2 if you add a full turn of 2\pi to it. So programmatically we can see that for every input we try multiplying by i rotates 90 degrees.

But we still haven’t proved it works. So let’s do that now. To say what the angle is between r \cos (\theta) + ri \sin (\theta) and i \cdot [r \cos (\theta) + ri \sin(\theta)] = -r \sin (\theta) + ri \cos(\theta), we need to transform the second number into the usual polar form (where the i is on the sine part and not the cosine part). But we know, or I’m telling you now, this nice fact about sine and cosine:

\displaystyle \sin(\theta + \pi/2) = cos(\theta)
\displaystyle \cos(\theta + \pi / 2) = -\sin(\theta)

This fact is maybe awkward to write out algebraically, but it’s just saying that if you shift the whole sine curve a little bit you get the cosine curve, and if you keep shifting it you get the opposite of the sine curve (and if you kept shifting it even more you’d eventually get back to the sine curve; they’re called periodic for this reason).

So immediately we can rewrite the second number as r \cos(\theta + \pi/2) + i r \sin (\theta + \pi/2). The angle is the same as the original angle plus a right angle of \pi/2. Neat!

Applying this same idea to (a + bi) \cdot (c + di), it’s not much harder to prove that multiplying two complex numbers in general multiplies their lengths and adds their angles. So if a complex number z has its magnitude r smaller than 1, multiplying by z squishes and rotates whatever is being multiplied. And if the magnitude is greater than 1, it stretches and rotates. So we have a super simple geometric understanding of how arithmetic with complex numbers works. And as we’re about to see, all this stretching and rotating results in some really weird (and beautifully mysterious!) mathematics and programs.

But before we do that we still have one question to address, the question that started this whole geometric train of thought: does i have a square root? Indeed, I’m just looking for a number such that, when I square its length and double its angle, I get i = \cos(\pi/2) + i \sin(\pi/2). Indeed, the angle we want is \pi/4, and the length we want is r = 1, which means \sqrt{i} = \cos(\pi/4) + i \sin(\pi/4). Sweet! There is another root if you play with the signs, see if you can figure it out.

In fact it’s a very deeper and more beautiful theorem (“theorem” means “really important fact”) called the fundamental theorem of algebra. And essentially it says that the complex numbers are complete. That is, we can always find square roots, cube roots, or anything roots of numbers involving i. It actually says a lot more, but it’s easier to appreciate the rest of it after you do more math than we’re going to do in this post.

On to pretty patterns!

The Fractal

So here’s a little experiment. Since every point in the plane is the end of some arrow representing a complex number, we can imagine transforming the entire complex plane by transforming each number by the same rule. The most interesting simple rule we can think of: squaring! So though it might strain your capacity for imagination, try to visualize the idea like this. Squaring a complex number is the same as squaring it’s length and doubling its angle. So imagine: any numbers whose arrows are longer than 1 will grow much bigger, arrows shorter than 1 will shrink, and arrows of length exactly one will stay the same length (arrows close to length 1 will grow/shrink much more slowly than those far away from 1). And complex numbers with small positive angles will increase their angle, but only a bit, while larger angles will grow faster.

Here’s an animation made by Douglas Arnold showing what happens to the set of complex numbers a + bi with 0 \leq a, b \leq 1 or -1 < a,b < 0. Again, imagine every point is the end of a different arrow for the corresponding complex number. The animation is for a single squaring, and the points move along the arc they would travel if one rotated/stretched them smoothly.

complex-squaring

So that’s pretty, but this is by all accounts a well-behaved transformation. It’s “predictable,” because for example we can always tell which complex numbers will get bigger and bigger (in length) and which will get smaller.

What if, just for the sake of tinkering, we changed the transformation a little bit? That is, instead of sending z = a+bi to z^2 (I’ll often write this z \mapsto z^2), what if we sent

\displaystyle z \mapsto z^2 + 1

Now it’s not so obvious: which numbers will grow and which will shrink? Notice that it’s odd because adding 1 only changes the real part of the number. So a number whose length is greater than 1 can become small under this transformation. For example, i is sent to 0, so something slightly larger would also be close to zero. Indeed, 5i/4 \mapsto -9/16.

So here’s an interesting question: are there any complex numbers that will stay small even if I keep transforming like this forever? Specifically, if I call f(z) = z^2, and I call f^2(z) = f(f(z)), and likewise call f^k(z) for k repeated transformations of z, is there a number z so that for every k, the value f^k(z) < 2? “Obvious” choices like z=0 don’t work, and neither do random guesses like z=i or z=1. So should we guess the answer is no?

Before we jump to conclusions let’s write a program to see what happens for more than our random guesses. The program is simple: we’ll define the “square plus one” function, and then repeatedly apply that function to a number for some long number of times (say, 250 times). If the length of the number stays under 2 after so many tries, we’ll call it “small forever,” and otherwise we’ll call it “not small forever.”

def squarePlusOne(z):
   return z*z + 1

def isSmallForever(z, f):
   k = 0

   while abs(z) < 2: z = f(z) k += 1 if k > 250:
         return True

   return False

This isSmallForever function is generic: you can give it any function f and it will repeatedly call f on z until the result grows bigger than 2 in length. Note that the abs function is a built-in Python function for computing the length of a complex number.

Then I wrote a classify function, which you can give a window and a small increment, and it will produce a grid of zeros and ones marking the results of isSmallForever. The details of the function are not that important. I also wrote a function that turns the grid into a picture. So here’s an example of how we’d use it:

from plotcomplex.plot import gridToImage

def classifySquarePlusOne(z):
   return isSmallForever(z, squarePlusOne)

grid = classify(classifySquarePlusOne) # the other arguments are defaulted to [-2,2], [-2,2], 0.1
gridToImage(grid)

And here’s the result. Points colored black grow beyond 2, and white points stay small for the whole test.

Looks like they'll always grow big.

Looks like they’ll always grow big.

So it looks like repeated squaring plus one will always make complex numbers grow big. That’s not too exciting, but we can always make it more exciting. What happens if we replace the 1 in z^2 + 1 with a different complex number? For example, if we do z^2 - 1 then will things always grow big?

You can randomly guess and see that 0 will never grow big, because 0^2 - 1 = -1 and (-1)^2 - 1 = 0. It will just oscillate forever. So with -1 some numbers will grow and some will not! Let’s use the same routine above to see which:

def classifySquareMinusOne(z):
      return isSmallForever(z, squareMinusOne)

grid = classify(classifySquareMinusOne)
gridToImage(grid)

And the result:

second-attempt

Now that’s a more interesting picture! Let’s ramp up the resolution

grid = classify(classifySquareMinusOne, step=0.001)
gridToImage(grid)

second-attempt-zoomed

Gorgeous. If you try this at home you’ll notice, however, that this took a hell of a long time to run. Speeding up our programs is very possible, but it’s a long story for another time. For now we can just be patient.

Indeed, this image has a ton of interesting details! It looks almost circular in the middle, but if we zoom in we can see that it’s more like a rippling wave

second-attempt-zoomed2

It’s pretty incredible, and a huge question is jumping out at me: what the heck is causing this pattern to occur? What secret does -1 know that +1 doesn’t that makes the resulting pattern so intricate?

But an even bigger question is this. We just discovered that some values of c make z \mapsto z^2 + c result in interesting patterns, and some do not! So the question is which ones make interesting patterns? Even if we just, say, fix the starting point to zero: what is the pattern in the complex numbers that would tell me when this transformation makes zero blow up, and when it keeps zero small?

Sounds like a job for another program. This time we’ll use a nice little Python feature called a closure, which we define a function that saves the information that exists when it’s created for later. It will let us write a function that takes in c and produces a function that transforms according to z \mapsto z^2+c.

def squarePlusC(c):
   def f(z):
      return z*z + c

   return f

And we can use the very same classification/graphing function from before to do this.

def classifySquarePlusC(c):
   return isSmallForever(0, squarePlusC(c))

grid = classify(classifySquarePlusC, xRange=(-2, 1), yRange=(-1, 1), step=0.005)
gridToImage(grid)

And the result:

mandelbrot

Stunning. This wonderful pattern, which is still largely not understood today, is known as the Mandelbrot set. That is, the white points are the points in the Mandlebrot set, and the black points are not in it. The detail on the border of this thing is infinitely intricate. For example, we can change the window in our little program to zoom in on a particular region.

mandelbrot-zoomed

And if you keep zooming in you keep getting more and more detail. This was true of the specific case of z^2 - 1, but somehow the patterns in the Mandelbrot set are much more varied and interesting. And if you keep going down eventually you’ll see patterns that look like the original Mandelbrot set. We can already kind of see that happening above. The name for this idea is a fractal, and the z^2 - 1 image has it too. Fractals are a fascinating and mysterious subject studied in a field called discrete dynamical systems. Many people dedicate their entire lives to studying these things, and it’s for good reason. There’s a lot to learn and even more that’s unknown!

So this is the end of our journey for now. I’ve posted all of the code we used in the making of this post so you can continue to play, but here are some interesting ideas.

  • The Mandelbrot set (and most fractals) are usually colored. The way they’re colored is as follows. Rather than just say true or false when zero blows up beyond 2 in length, you return the number of iterations k that happened. Then you pick a color based on how big k is. There’s a link below that lets you play with this. In fact, adding colors shows that there is even more intricate detail happening outside the Mandelbrot set that’s too faint to see in our pictures above. Such as this.
  • Some very simple questions about fractals are very hard to answer. For example, is the Mandelbrot set connected? That is, is it possible to “walk” from every point in the Mandelbrot set to every other point without leaving the set? Despite the scattering of points in the zoomed in picture above that suggest the answer is no, the answer is actually yes! This is a really difficult thing to prove, however.
  • The patterns in many fractals are often used to generate realistic looking landscapes and generate pseudo randomness. So fractals are not just mathematical curiosities.
  • You should definitely be experimenting with this stuff! What happens if you change the length threshold from 2 to some bigger number? What about a smaller number? What if you do powers different than 2? There’s so much to explore!
  • The big picture thing to take away from this is that it’s not the numbers themselves that are particularly interesting, it’s the transformations of the numbers that generate these patterns! The interesting questions are what kinds of things are the same under these transformations, and what things are different. This is a very general idea in mathematics, and the more math you do the more you’ll find yourself wondering about useful and bizarre transformations.

For the chance to keep playing with the Mandelbrot set, check out this Mandelbrot grapher that works in your browser. It lets you drag rectangles to zoom further in on regions of interest. It’s really fun.

Until next time!

Simulating a Biased Coin with a Fair Coin

This is a guest post by my friend and colleague Adam Lelkes. Adam’s interests are in algebra and theoretical computer science. This gem came up because Adam gave a talk on probabilistic computation in which he discussed this technique.

Problem: simulate a biased coin using a fair coin.

Solution: (in Python)

def biasedCoin(binaryDigitStream, fairCoin):
   for d in binaryDigitStream:
      if fairCoin() != d:
         return d

Discussion: This function takes two arguments, an iterator representing the binary expansion of the intended probability of getting 1 (let us denote it as p) and another function that returns 1 or 0 with equal probability. At first glance this might seem like an overcomplicated way of solving this problem: why can’t the probability be a floating point number?

The point is that p can have infinite precision! Assuming that fairCoin() gives us a perfectly random stream of 1’s and 0’s (independently and with probability 1/2) and we can read each bit of the binary expansion of p, this function returns 1 with probability exactly p even if p is irrational or a fraction with infinite decimal expansion. If we used floating point arithmetic there would be a small chance we get unlucky and exhaust the precision available. We would only get an approximation of the true bias at best.

Now let us explain why this algorithm works. We keep tossing our fair coins to get a sequence of random bits, until one of our random bits is different from the corresponding bit in the binary expansion of p. If we stop after i steps, that means that the first i-1 bits in the two binary sequences were the same, which happens with probability \frac{1}{2^{i-1}}. Given that this happens, in the ith step we will return the ith bit of p; let us denote this bit by p_i. Then the probability of returning 1 is \sum_{i=1}^\infty \frac{p_i}{2^{i-1}}, which is the binary expansion of p.

This algorithm is also efficient. By efficient here we mean that the expected running time is constant. Of course, to show this we need to make some assumption about the computational complexity of calculating the bits of p. If we assume that the bits of p are efficiently computable in the sense that the time required to compute p_i is bounded by a polynomial in i, then this algorithm does run in constant expected time.

Indeed, the expected running time is \sum_{i=0}^\infty \frac{i^n}{2^i}. Showing that this sum is a constant is an easy calculus exercise: using the ratio test we get that

\displaystyle \textup{limsup}_{i \to \infty} \left | \frac{\frac{(i+1)^n}{2^{i+1}}}{\frac{i^n}{2^i}} \right | = \limsup_{i\to\infty} \frac{\left(\frac{i+1}{i}\right)^n}{2} = \frac{1}{2} < 1,

thus the series is convergent.

Now that we proved that our algorithm works, it’s time to try it! Let’s say that we want to simulate a coin which gives “heads” with probability 1/3.
We need to construct our binary digit stream. Since 1/3 is 0.010101… in binary, we could use the following simple generator:

def oneThird():
   while True:
      yield 0
      yield 1

However, we might want to have a more general generator that gives us the binary representation of any number. The following function, which takes a number between 0 and 1 as its argument, does the job:

def binaryDigits(fraction):
   while True:
      fraction *= 2
      yield int(fraction)
      fraction = fraction % 1

We also need a fair coin simulator. For this simulation, let’s just use Python’s built-in pseudo-random number generator:

def fairCoin():
   return random.choice([0,1])

Let us toss our biased coin 10000 times and take the sum. We expect the sum to be around 3333. Indeed, when I tried

>>> sum(biasedCoin(oneThird(), fairCoin) for i in range(10000))
3330

It might be worth noting oneThird() is approximately ten times faster than binaryDigits(fractions.Fraction(1,3)), so when a large number of biased coins is needed, you can hardwire the binary representation of p into the program.