The Complexity of Communication

One of the most interesting questions posed in the last thirty years of computer science is to ask how much “information” must be communicated between two parties in order for them to jointly compute something. One can imagine these two parties living on distant planets, so that the cost of communicating any amount of information is very expensive, but each person has an integral component of the answer that the other does not.

Since this question was originally posed by Andrew Yao in 1979, it has led to a flurry of applications in many areas of mathematics and computer science. In particular it has become a standard tool for proving lower bounds in many settings such as circuit design and streaming algorithms. And if there’s anything theory folks love more than a problem that can be solved by an efficient algorithm, it’s a proof that a problem cannot be solved by any efficient algorithm (that’s what I mean by “lower bound”).

Despite its huge applicability, the basic results in this area are elementary. In this post we’ll cover those basics, but once you get past these basic ideas and their natural extensions you quickly approach the state of the art and open research problems. Attempts to tackle these problems in recent years have used sophisticated techniques in Fourier analysis, Ramsey theory, and geometry. This makes it a very fun and exciting field.

As a quick side note before we start, the question we’re asking is different from the one of determining the information content of a specific message. That is the domain of information theory, which was posed (and answered) decades earlier. Here we’re trying to determine the complexity of a problem, where more complex messages require more information about their inputs.

The Basic Two-Player Model

The most basic protocol is simple enough to describe over a dinner table. Alice and Bob each have one piece of information $x,y$, respectively, say they each have a number. And together they want to compute some operation that depends on both their inputs, for example whether $x > y$. But in the beginning Alice has access only to her number $x$, and knows nothing about $y$. So Alice sends Bob a few bits. Depending on the message Bob computes something and replies, and this repeats until they have computed an answer. The question is: what is the minimum number of bits they need to exchange in order for both of them to be able to compute the right answer?

There are a few things to clarify here: we’re assuming that Alice and Bob have agreed on a protocol for sending information before they ever saw their individual numbers. So imagine ten years earlier Alice and Bob were on the same planet, and they agreed on the rules they’d follow for sending/replying information once they got their numbers. In other words, we’re making a worst-case assumption on Alice and Bob’s inputs, and as usual it will be measured as a function of $n$, the lengths of their inputs. Then we take a minimum (asymptotically) over all possible protocols they could follow, and this value is the “communication complexity” of the problem. Computing the exact communication complexity of a given problem is no simple task, since there’s always the nagging question of whether there’s some cleverer protocol than the one you came up with. So most of the results are bounds on the communication complexity of a problem.

Indeed, we can give our first simple bound for the “$x$ greater than $y$” problem we posed above. Say the strings $x,y$ both have $n$ bits. What Alice does is send her entire string $x$ to Bob, and Bob then computes the answer and sends the answer bit back to Alice. This requires $n + 1$ bits of communication. This proves that the communication complexity of “$x > y$” is bounded from above by $n+1$. A much harder question is, can we do any better?

To make any progress on upper or lower bounds we need to be a bit more formal about the communication model. Basically, the useful analysis happens when the players alternate sending single bits, and this is only off by small constant factors from a more general model. This is the asymptotic analysis, that we only distinguish between things like linear complexity $O(n)$ versus sublinear options like $\log(n)$ or $\sqrt{n}$ or even constant complexity $O(1)$. Indeed, the protocol we described for $x > y$ is the stupidest possible protocol for the problem, and it’s actually valid for any problem. For this problem it happens to be optimal, but we’re just trying to emphasize that nontrivial bounds are all sub-linear in the size of the inputs.

On to the formal model.

Definition: player is a computationally unbounded Turing machine.

And we really mean unbounded. Our players have no time or space constraints, and if they want they can solve undecidable problems like the halting problem or computing Kolmogorov complexity. This is to emphasize that the critical resource is the amount of communication between players. Moreover, it gives us a hint that lower bounds in this model won’t come form computational intractability, but instead will be “information-theoretic.”

Definition: Let $\Sigma^*$ be the set of all binary strings. A communication protocol is a pair of functions $A,B: \Sigma^* \times \Sigma^* \to \{ 0,1 \}$.

The input to these functions $A(x, h)$ should be thought of as follows: $x$ is the player’s secret input and $h$ is the communication history so far. The output is the single bit that they will send in that round (which can be determined by the length of $h$ since only one bit is sent in each round). The protocol then runs by having Alice send $b_1 = A(x, \{ \})$ to Bob, then Bob replies with $b_2 = B(y, b_1)$, Alice continues with $b_3 = A(x, b_1b_2)$, and so on. We implicitly understand that the content of a communication protocol includes a termination condition, but we’ll omit this from the notation. We call the length of the protocol the number of rounds.

Definition: A communication protocol $A,B$ is said to be valid for a boolean function $f(x,y)$ if for all strings $x, y$, the protocol for $A, B$ terminates on some round $t$ with $b_t = 1$ if and only if $f(x,y) = 1$.

So to define the communication complexity, we let the function $L_{A,B}(n)$ be the maximum length of the protocol $A, B$ when run on strings of length $n$ (the worst-case for a given input size). Then the communication complexity of a function $f$ is the minimum of $L_{A,B}$ over all valid protocols $A, B$. In symbols,

$\displaystyle CC_f(n) = \min_{A,B \textup{ is valid for } f} L_{A,B}(n)$

We will often abuse the notation by writing the communication complexity of a function as $CC(f)$, understanding that it’s measured asymptotically as a function of $n$.

Matrices and Lower Bounds

Let’s prove a lower bound, that to compute the equality function you need to send a linear number of bits in the worst case. In doing this we’ll develop a general algebraic tool.

So let’s write out the function $f$ as a binary matrix $M(f)$ in the following way. Write all $2^n$ inputs of length $n$ in some fixed order along the rows and columns of the matrix, and let entry $i,j$ be the value of $f(i,j)$. For example, the 6-bit function $f$ which computes whether the majority of the two player’s bits are ones looks like this:

The key insight to remember is that if the matrix of a function has a nice structure, then one needs very little communication to compute it. Let’s see why.

Say in the first round the row player sends a bit $b$. This splits the matrix into two submatrices $A_0, A_1$ by picking the rows of $A_0$ to be those inputs for which the row player sends a $b=0$, and likewise for $A_1$ with $b=1$. If you’re willing to rearrange the rows of the matrix so that $A_0$ and $A_1$ stack on top of each other, then this splits the matrix into two rectangles. Now we can switch to the column player and see which bit he sends in reply to each of the possible choices for $b$ (say he sends back $b'$). This separately splits each of $A_0, A_1$ into two subrectangles corresponding to which inputs for the column player make him send the specific value of $b'$. Continuing in this fashion we recurse until we find a submatrix consisting entirely of ones or entirely of zeros, and then we can say with certainty what the value of the function $f$ is.

It’s difficult to visualize because every time we subdivide we move around the rows and columns within the submatrix corresponding to the inputs for each player. So the following would be a possible subdivision of an 8×8 matrix (with the values in the rectangles denoting which communicated bits got you there), but it’s sort of a strange one because we didn’t move the inputs around at all. It’s just a visual aid.

If we do this for $t$ steps we get $2^t$ subrectangles. A crucial fact is that any valid communication protocol for a function has to give a subdivision of the matrix where all the rectangles are constant. or else there would be two pairs of inputs $(x,y), (x', y')$, which are labeled identically by the communication protocol, but which have different values under $f$.

So naturally one expects the communication complexity of $f$ would require as many steps as there are steps in the best decomposition, that is, the decomposition with the fewest levels of subdivision. Indeed, we’ll prove this and introduce some notation to make the discourse less clumsy.

Definition: For an $m \times n$ matrix $M$, a rectangle is a submatrix $A \times B$ where $A \subset \{ 1, \dots m \}, B \subset \{ 1, \dots, n \}$. A rectangle is called monochromatic if all entires in the corresponding submatrix $\left.M\right|_{A \times B}$ are the same. A monochromatic tiling of $M$ is a partition of $M$ into disjoint monochromatic rectangles. Define $\chi(f)$ to be the minimum number of rectangles in any monochromatic tiling of $M(f)$.

As we said, if there are $t$ steps in a valid communication protocol for $f$, then there are $2^t$ rectangles in the corresponding monochromatic tiling of $M(f)$. Here is an easy consequence of this.

Proposition: If $f$ has communication complexity $CC(f)$, then there is a monochromatic tiling of $M(f)$ with at most $2^{CC(f)}$ rectangles. In particular, $\log(\chi(f)) \leq CC(f)$.

Proof. Pick any protocol that achieves the communication complexity of $f$, and apply the process we described above to subdivide $M(f)$. This will take exactly $CC(f)$, and produce no more than $2^{CC(f)}$ rectangles.

$\square$

This already gives us a bunch of theorems. Take the EQ function, for example. Its matrix is the identity matrix, and it’s not hard to see that every monochromatic tiling requires $2^n$ rectangles, one for each entry of the diagonal. I.e., $CC(EQ) \geq n$. But we already know that one player can just send all his bits, so actually $CC(EQ) = \Theta(n)$. Now it’s not always so easy to compute $\chi(f)$. The impressive thing to do is to use efficiently computable information about $M(f)$ to give bounds on $\chi(f)$ and hence on $CC(f)$. So can we come up with a better lower bound that depends on something we can compute? The answer is yes.

Theorem: For every function $f$, $\chi(f) \geq \textup{rank }M(f)$.

Proof. This just takes some basic linear algebra. One way to think of the rank of a matrix $A$ is as the smallest way to write $A$ as a linear combination of rank 1 matrices (smallest as in, the smallest number of terms needed to do this). The theorem is true no matter which field you use to compute the rank, although in this proof and in the rest of this post we’ll use the real numbers.

If you give me a monochromatic tiling by rectangles, I can view each rectangle as a matrix whose rank is at most one. If the entries are all zeros then the rank is zero, and if the entries are all ones then (using zero elsewhere) this is by itself a rank 1 matrix. So adding up these rectangles as separate components gives me an upper bound on the rank of $A$. So the minimum way to do this is also an upper bound on the rank of $A$.

$\square$

Now computing something like $CC(EQ)$ is even easier, because the rank of $M(EQ) = M(I_{2^n})$ is just $2^n$.

Upper Bounds

There are other techniques to show lower bounds that are stronger than the rank and tiling method (because they imply the rank and tiling method). See this survey for a ton of details. But I want to discuss upper bounds a bit, because the central open conjecture in communication complexity is an upper bound.

The Log-Rank Conjecture: There is a universal constant $c$, such that for all $f$, the communication complexity $CC(f) = O((\log \textup{rank }M(f))^c)$.

All known examples satisfy the conjecture, but unfortunately the farthest progress toward the conjecture is still exponentially worse than the conjecture’s statement. In 1997 the record was due to Andrei Kotlov who proved that $CC(f) \leq \log(4/3) \textup{rank }M(f)$. It was not until 2013 that any (unconditional) improvements were made to this, when Shachar Lovett proved that $CC(f) = O(\sqrt{\textup{rank }M(f)} \cdot \log \textup{rank }M(f))$.

The interested reader can check out this survey of Shachar Lovett from earlier this year (2014) for detailed proofs of these theorems and a discussion of the methods. I will just discuss one idea from this area that ties in nicely with our discussion: which is that finding an efficient communication protocol for a low-rank function $f$ reduces to finding a large monochromatic rectangle in $M(f)$.

Theorem [Nisan-Wigderson 94]: Let $c(r)$ be a function. Suppose that for any function $f: X \times Y \to \{ 0,1 \}$, we can find a monochromatic rectangle of size $R \geq 2^{-c(r)} \cdot | X \times Y |$ where $r = \textup{rank }M(f)$. Then any such $f$ is computable by a deterministic protocol with communication complexity.

$\displaystyle O \left ( \log^2(r) + \sum_{i=0}^{\log r} c(r/2^i) \right )$

Just to be concrete, this says that if $c(r)$ is polylogarithmic, then finding these big rectangles implies a protocol also with polylogarithmic complexity. Since the complexity of the protocol is a function of $r$ alone, the log-rank conjecture follows as a consequence. The best known results use the theorem for larger $c(r) = r^b$ for some $b < 1$, which gives communication complexity also $O(r^b)$.

The proof of the theorem is detailed, but mostly what you’d expect. You take your function, split it up into the big monochromatic rectangle and the other three parts. Then you argue that when you recurse to one of the other three parts, either the rank is cut in half, or the size of the matrix is much smaller. In either case, you can apply the theorem once again. Then you bound the number of leaves in the resulting protocol tree by looking at each level $i$ where the rank has dropped to $r/2^i$. For the full details, see page 4 of the Shachar survey.

Multiple Players and More

In the future we’ll cover some applications of communication complexity, many of which are related to computing in restricted models such as parallel computation and streaming computation. For example, in parallel computing you often have processors which get arbitrary chunks of data as input and need to jointly compute something. Lower bounds on the communication complexity can help you prove they require a certain amount of communication in order to do that.

But in these models there are many players. And the type of communication matters: it can be point-to-point or broadcast, or something more exotic like MapReduce. So before we can get to these applications we need to define and study the appropriate generalizations of communication complexity to multiple interacting parties.

Until then!

Learning to Love Complex Numbers

This post is intended for people with a little bit of programming experience and no prior mathematical background.

Numbers are curious things. On one hand, they represent one of the most natural things known to humans, which is quantity. It’s so natural to humans that even newborn babies are in tune with the difference between quantities of objects between 1 and 3, in that they notice when quantity changes much more vividly than other features like color or shape.

But our familiarity with quantity doesn’t change the fact that numbers themselves (as an idea) are a human invention. And they’re not like most human inventions, the kinds where you have to tinker with gears or circuits to get a machine that makes your cappuccino. No, these are mathematical inventions. These inventions exist only in our minds.

Numbers didn’t always exist. A long time ago, back when the Greeks philosophers were doing their philosophizing, negative numbers didn’t exist! In fact, it wasn’t until 1200 AD that the number zero was first considered in Europe. Zero, along with negative numbers and fractions and square roots and all the rest, were invented primarily to help people solve more problems than they could with the numbers they had available. That is, numbers were invented primarily as a way for people to describe their ideas in a useful way. People simply  wondered “is there a number whose square gives you 2?” And after a while they just decided there was and called it $\sqrt{2}$ because they didn’t have a better name for it.

But with these new solutions came a host of new problems. You see, although I said mathematical inventions only exist in our minds, once they’re invented they gain a life of their own. You start to notice patterns in your mathematical objects and you have to figure out why they do the things they do. And numbers are a perfectly good example of this: once I notice that I can multiply a number by itself, I can ask how often these “perfect squares” occur. That is, what’s the pattern in the numbers $1^2, 2^2, 3^2, 4^2, \dots$? If you think about it for a while, you’ll find that square numbers have a very special relationship with odd numbers.

Other times, however, the things you invent turn out to make no sense at all, and you can prove they never existed in the first place! It’s an odd state of affairs, but we’re going to approach the subject of complex numbers from this mindset. We’re going to come up with a simple idea, the idea that negative numbers can be perfect squares, and explore the world of patterns it opens up. Along the way we’ll do a little bit of programming to help explore, give some simple proofs to solidify our intuition, and by the end we’ll see how these ideas can cause wonderful patterns like this one:

The number i

Let’s bring the story back around to squares. One fact we all remember about numbers is that squaring a number gives you something non-negative. $7^2 = 49, (-2)^2 = 4, 0^2 = 0$, and so on. But it certainly doesn’t have to be this way. What if we got sick of that stupid fact and decided to invent a new number whose square was negative? Which negative, you ask? Well it doesn’t really matter, because I can always stretch it larger or smaller so that it’s square is -1.

Let’s see how: if you say that your made-up number $x$ makes $x^2 = -7$, then I can just use $\frac{x}{\sqrt{7}}$ to get a number whose square is -1. If you’re going to invent a number that’s supposed to interact with our usual numbers, then you have to be allowed to add, subtract, and multiply $x$ with regular old real numbers, and the usual properties would have to still work. So it would have to be true that $(x / \sqrt{7})^2 = x^2 / \sqrt{7}^2 = -7/7 = -1$.

So because it makes no difference (this is what mathematicians mean by, “without loss of generality”) we can assume that the number we’re inventing will have a square of negative one. Just to line up with history, let’s call the new number $i$. So there it is: $i$ exists and $i^2 = -1$. And now that we are “asserting” that $i$ plays nicely with real numbers, we get these natural rules for adding and subtracting and multiplying and dividing. For example

• $1 + i$ is a new number, which we’ll just call $1+i$. And if we added two of these together, $(1+ i) + (1+i)$, we can combine the real parts and the $i$ parts to get $2 + 2i$. Same goes for subtraction. In general a complex number looks like $a + bi$, because as we’ll see in the other points you can simplify every simple arithmetic expression down to just one “real number” part and one “real number times $i$” part.
• We can multiply $3 \cdot i$, and we’ll just call it $3i$, and we require that multiplication distributes across addition (that the FOIL rule works). So that, for example, $(2 - i)(1 + 3i) = (2 + 6i - i - 3i^2) = (2 + 3) + (6i - i) = (5 + 5i)$.
• Dividing is a significantly more annoying. Say we want to figure out what $1 / (1+i)$ is (in fact, it’s not even obvious that this should look like a regular number! But it does). The $1 / a$ notation just means we’re looking for a number which, when we multiply by the denominator $a$, we get back to 1. So we’re looking to find out when $(a + bi)(1 + i) = 1 + 0i$ where $a$ and $b$ are variables we’re trying to solve for. If we multiply it out we get $(a-b) + (a + b)i = 1 + 0i$, and since the real part and the $i$ part have to match up, we know that $a - b = 1$ and $a + b = 0$. If we solve these two equations, we find that $a = 1/2, b = -1/2$ works great. If we want to figure out something like $(2 + 3i) / (1 - i)$, we just find out what $1 / (1- i)$ is first, and then multiply the result by $(2+3i)$.

So that was tedious and extremely boring, and we imagine you didn’t even read it (that’s okay, it really is boring!). All we’re doing is establishing ground rules for the game, so if you come across some arithmetic that doesn’t make sense, you can refer back to this list to see what’s going on. And once again, for the purpose of this post, we’re asserting that all these laws hold. Maybe some laws follow from others, but as long as we don’t come up with any nasty self-contradictions we’ll be fine.

And now we turn to the real questions: is $i$ the only square root of -1? Does $i$ itself have a square root? If it didn’t, we’d be back to where we started, with some numbers (the non-$i$ numbers) having square roots while others don’t. And so we’d feel the need to make all the $i$ numbers happy by making up more numbers to be their square roots, and then worrying what if these new numbers don’t have square roots and…gah!

I’ll just let you in on the secret to save us from this crisis. It turns out that $i$ does have a square root in terms of other $i$ numbers, but in order to find it we’ll need to understand $i$ from a different angle, and that angle turns out to be geometry.

Geometry? How is geometry going to help me understand numbers!?

It’s a valid question and part of why complex numbers are so fascinating. And I don’t mean geometry like triangles and circles and parallel lines (though there will be much talk of angles), I mean transformations in the sense that we’ll be “stretching,” “squishing,” and “rotating” numbers. Maybe another time I can tell you why for me “geometry” means stretching and rotating; it’s a long but very fun story.

The clever insight is that you can represent complex numbers as geometric objects in the first place. To do it, you just think of $a + bi$ as a pair of numbers $(a,b)$, (the pair of real part and $i$ part), and then plot that point on a plane. For us, the $x$-axis will be the “real” axis, and the $y$-axis will be the $i$-axis. So the number $(3 - 4i)$ is plotted 3 units in the positive $x$ direction and 4 units in the negative $y$ direction. Like this:

The “j” instead of “i” is not a typo, but a disappointing fact about the programming language we used to make this image. We’ll talk more about why later.

We draw it as an arrow for a good reason. Stretching, squishing, rotating, and reflecting will all be applied to the arrow, keeping its tail fixed at the center of the axes. Sometimes the arrow is called a “vector,” but we won’t use that word because here it’s synonymous with “complex number.”

So let’s get started squishing stuff.

Stretching, Squishing, Rotating

Before we continue I should clear up some names. We call a number that has an $i$ in it a complex number, and we call the part without the $i$ the real part (like 2 in $2-i$) and the part with $i$ the complex part.

Python is going to be a great asset for us in exploring complex numbers, so let’s jump right into it. It turns out that Python natively supports complex numbers, and I wrote a program for drawing complex numbers. I used it to make the plot above. The program depends on a library I hate called matplotlib, and so the point of the program is to shield you from as much pain as possible and focus on complex numbers. You can use the program by downloading it from this blog’s Github page, along with everything else I made in writing this post. All you need to know how to do is call a function, and I’ve done a bit of window dressing removal to simplify things (I really hate matplotlib).

# plotComplexNumbers : [complex] -&gt; None
# display a plot of the given list of complex numbers
def plotComplexNumbers(numbers):
...


Before we show some examples of how to use it, we have to understand how to use complex numbers in Python. It’s pretty simple, except that Python was written by people who hate math, and so they decided the complex number would be represented by $j$ instead of $i$ (people who hate math are sometimes called “engineers,” and they use $j$ out of spite. Not really, though).

So in Python it’s just like any other computation. For example:

>>> (1 + 1j)*(4 - 2j) == (6+2j)
True
>>> 1 / (1+1j)
(0.5-0.5j)

And so calling the plotting function with a given list of complex numbers is as simple as importing the module and calling the function

from plotcomplex import plot
plot.plotComplexNumbers([(-1+1j), (1+2j), (-1.5 - 0.5j), (.6 - 1.8j)])


Here’s the result

So let’s use plots like this one to explore what “multiplication by $i$” does to a complex number. It might not seem exciting at first, but I promise there’s a neat punchline.

Even without plotting it’s pretty easy to tell what multiplying by $i$ does to some numbers. It takes 1 to $i$, moves $i$ to $i^2 = -1$, it takes -1 to $-i$, and $-i$ to $-i \cdot i = 1$.

What’s the pattern in these? well if we plot all these numbers, they’re all at right angles in counter-clockwise order. So this might suggest that multiplication by $i$ does some kind of rotation. Is that always the case? Well lets try it with some other more complicated numbers. Click the plots below to enlarge.

Well, it looks close but it’s hard to tell. Some of the axes are squished and stretched, so it might be that our images don’t accurately represent the numbers (the real world can be such a pain). Well when visual techniques fail, we can attempt to prove it.

Clearly multiplying by $i$ does some kind of rotation, maybe with other stuff too, and it shouldn’t be so hard to see that multiplying by $i$ does the same thing no matter which number you use (okay, the skeptical readers will say that’s totally hard to see, but we’ll prove it super rigorously in a minute). So if we take any number and multiply it by $i$ once, then twice, then three times, then four, and if we only get back to where we started at four multiplications, then each rotation had to be a quarter turn.

Indeed,

$\displaystyle (a + bi) i^4 = (ai - b) i^3 = (-a - bi) i^2 = (-ai + b) i = a + bi$

This still isn’t all that convincing, and we want to be 100% sure we’re right. What we really need is a way to arithmetically compute the angle between two complex numbers in their plotted forms. What we’ll do is find a way to measure the angle of one complex number with the $x$-axis, and then by subtraction we can get angles between arbitrary points. For example, in the figure below $\theta = \theta_1 - \theta_2$.

One way to do this is with trigonometry: the geometric drawing of $a + bi$ is the hypotenuse of a right triangle with the $x$-axis.

And so if $r$ is the length of the arrow, then by the definition of sine and cosine, $\cos(\theta) = a/r, \sin(\theta) = b/r$. If we have $r, \theta$, and $r > 0$, we can solve for a unique $a$ and $b$, so instead of representing a complex number in terms of the pair of numbers $(a,b)$, we can represent it with the pair of numbers $(r, \theta)$. And the conversion between the two is just

$a + bi = r \cos(\theta) + (r \sin(\theta)) i$

The $(r, \theta)$ representation is called the polar representation, while the $(a,b)$ representation is called the rectangular representation or the Cartesian representation. Converting between polar and Cartesian coordinates fills the pages of many awful pre-calculus textbooks (despite the fact that complex numbers don’t exist in classical calculus). Luckily for us Python has built-in functions to convert between the two representations for us.

&gt;&gt;&gt; import cmath
&gt;&gt;&gt; cmath.polar(1 + 1j)
(1.4142135623730951, 0.7853981633974483)
&gt;&gt;&gt; z = cmath.polar(1 + 1j)
&gt;&gt;&gt; cmath.rect(z[0], z[1])
(1.0000000000000002+1j)


It’s a little bit inaccurate on the rounding, but it’s fine for our purposes.

So how do we compute the angle between two complex numbers? Just convert each to the polar form, and subtract the second coordinates. So if we get back to our true goal, to figure out what multiplication by $i$ does, we can just do everything in polar form. Here’s a program that computes the angle between two complex numbers.

def angleBetween(z, w):
zPolar, wPolar = cmath.polar(z), cmath.polar(w)
return wPolar[1] - zPolar[1]

print(angleBetween(1 + 1j, (1 + 1j) * 1j))
print(angleBetween(2 - 3j, (2 - 3j) * 1j))
print(angleBetween(-0.5 + 7j, (-0.5 + 7j) * 1j))


Running it gives

1.5707963267948966
1.5707963267948966
-4.71238898038469


Note that the decimal form of $\pi/2$ is 1.57079…, and that the negative angle is equivalent to $\pi/2$ if you add a full turn of $2\pi$ to it. So programmatically we can see that for every input we try multiplying by $i$ rotates 90 degrees.

But we still haven’t proved it works. So let’s do that now. To say what the angle is between $r \cos (\theta) + ri \sin (\theta)$ and $i \cdot [r \cos (\theta) + ri \sin(\theta)] = -r \sin (\theta) + ri \cos(\theta)$, we need to transform the second number into the usual polar form (where the $i$ is on the sine part and not the cosine part). But we know, or I’m telling you now, this nice fact about sine and cosine:

$\displaystyle \sin(\theta + \pi/2) = cos(\theta)$
$\displaystyle \cos(\theta + \pi / 2) = -\sin(\theta)$

This fact is maybe awkward to write out algebraically, but it’s just saying that if you shift the whole sine curve a little bit you get the cosine curve, and if you keep shifting it you get the opposite of the sine curve (and if you kept shifting it even more you’d eventually get back to the sine curve; they’re called periodic for this reason).

So immediately we can rewrite the second number as $r \cos(\theta + \pi/2) + i r \sin (\theta + \pi/2)$. The angle is the same as the original angle plus a right angle of $\pi/2$. Neat!

Applying this same idea to $(a + bi) \cdot (c + di)$, it’s not much harder to prove that multiplying two complex numbers in general multiplies their lengths and adds their angles. So if a complex number $z$ has its magnitude $r$ smaller than 1, multiplying by $z$ squishes and rotates whatever is being multiplied. And if the magnitude is greater than 1, it stretches and rotates. So we have a super simple geometric understanding of how arithmetic with complex numbers works. And as we’re about to see, all this stretching and rotating results in some really weird (and beautifully mysterious!) mathematics and programs.

But before we do that we still have one question to address, the question that started this whole geometric train of thought: does $i$ have a square root? Indeed, I’m just looking for a number such that, when I square its length and double its angle, I get $i = \cos(\pi/2) + i \sin(\pi/2)$. Indeed, the angle we want is $\pi/4$, and the length we want is $r = 1$, which means $\sqrt{i} = \cos(\pi/4) + i \sin(\pi/4)$. Sweet! There is another root if you play with the signs, see if you can figure it out.

In fact it’s a very deeper and more beautiful theorem (“theorem” means “really important fact”) called the fundamental theorem of algebra. And essentially it says that the complex numbers are complete. That is, we can always find square roots, cube roots, or anything roots of numbers involving $i$. It actually says a lot more, but it’s easier to appreciate the rest of it after you do more math than we’re going to do in this post.

On to pretty patterns!

The Fractal

So here’s a little experiment. Since every point in the plane is the end of some arrow representing a complex number, we can imagine transforming the entire complex plane by transforming each number by the same rule. The most interesting simple rule we can think of: squaring! So though it might strain your capacity for imagination, try to visualize the idea like this. Squaring a complex number is the same as squaring it’s length and doubling its angle. So imagine: any numbers whose arrows are longer than 1 will grow much bigger, arrows shorter than 1 will shrink, and arrows of length exactly one will stay the same length (arrows close to length 1 will grow/shrink much more slowly than those far away from 1). And complex numbers with small positive angles will increase their angle, but only a bit, while larger angles will grow faster.

Here’s an animation made by Douglas Arnold showing what happens to the set of complex numbers $a + bi$ with $0 \leq a, b \leq 1$ or $-1 < a,b < 0$. Again, imagine every point is the end of a different arrow for the corresponding complex number. The animation is for a single squaring, and the points move along the arc they would travel if one rotated/stretched them smoothly.

So that’s pretty, but this is by all accounts a well-behaved transformation. It’s “predictable,” because for example we can always tell which complex numbers will get bigger and bigger (in length) and which will get smaller.

What if, just for the sake of tinkering, we changed the transformation a little bit? That is, instead of sending $z = a+bi$ to $z^2$ (I’ll often write this $z \mapsto z^2$), what if we sent

$\displaystyle z \mapsto z^2 + 1$

Now it’s not so obvious: which vectors will grow and which will shrink? Notice that it’s odd because adding 1 only changes the real part of the number. So a number whose length is greater than 1 can become small under this transformation. For example, $i$ is sent to $0$, so something slightly larger would also be close to zero. Indeed, $5i/4 \mapsto -9/16$.

So here’s an interesting question: are there any complex numbers that will stay small even if I keep transforming like this forever? Specifically, if I call $f(z) = z^2$, and I call $f^2(z) = f(f(z))$, and likewise call $f^k(z)$ for $k$ repeated transformations of $z$, is there a number $z$ so that for every $k$, the value $f^k(z) < 2$? “Obvious” choices like $z=0$ don’t work, and neither do random guesses like $z=i$ or $z=1$. So should we guess the answer is no?

Before we jump to conclusions let’s write a program to see what happens for more than our random guesses. The program is simple: we’ll define the “square plus one” function, and then repeatedly apply that function to a number for some long number of times (say, 250 times). If the length of the number stays under 2 after so many tries, we’ll call it “small forever,” and otherwise we’ll call it “not small forever.”

def squarePlusOne(z):
return z*z + 1

def isSmallForever(z, f):
k = 0

while abs(z) &lt; 2:
z = f(z)
k += 1

if k &gt; 250:
return True

return False


This isSmallForever function is generic: you can give it any function $f$ and it will repeatedly call $f$ on $z$ until the result grows bigger than 2 in length. Note that the abs function is a built-in Python function for computing the length of a complex number.

Then I wrote a classify function, which you can give a window and a small increment, and it will produce a grid of zeros and ones marking the results of isSmallForever. The details of the function are not that important. I also wrote a function that turns the grid into a picture. So here’s an example of how we’d use it:

from plotcomplex.plot import gridToImage

def classifySquarePlusOne(z):
return isSmallForever(z, squarePlusOne)

grid = classify(classifySquarePlusOne) # the other arguments are defaulted to [-2,2], [-2,2], 0.1
gridToImage(grid)


And here’s the result. Points colored black grow beyond 2, and white points stay small for the whole test.

Looks like they’ll always grow big.

So it looks like repeated squaring plus one will always make complex numbers grow big. That’s not too exciting, but we can always make it more exciting. What happens if we replace the 1 in $z^2 + 1$ with a different complex number? For example, if we do $z^2 - 1$ then will things always grow big?

You can randomly guess and see that 0 will never grow big, because $0^2 - 1 = -1$ and $(-1)^2 - 1 = 0$. It will just oscillate forever. So with -1 some numbers will grow and some will not! Let’s use the same routine above to see which:

def classifySquareMinusOne(z):
return isSmallForever(z, squareMinusOne)

grid = classify(classifySquareMinusOne)
gridToImage(grid)


And the result:

Now that’s a more interesting picture! Let’s ramp up the resolution

grid = classify(classifySquareMinusOne, step=0.001)
gridToImage(grid)


Gorgeous. If you try this at home you’ll notice, however, that this took a hell of a long time to run. Speeding up our programs is very possible, but it’s a long story for another time. For now we can just be patient.

Indeed, this image has a ton of interesting details! It looks almost circular in the middle, but if we zoom in we can see that it’s more like a rippling wave

It’s pretty incredible, and a huge question is jumping out at me: what the heck is causing this pattern to occur? What secret does -1 know that +1 doesn’t that makes the resulting pattern so intricate?

But an even bigger question is this. We just discovered that some values of $c$ make $z \mapsto z^2 + c$ result in interesting patterns, and some do not! So the question is which ones make interesting patterns? Even if we just, say, fix the starting point to zero: what is the pattern in the complex numbers that would tell me when this transformation makes zero blow up, and when it keeps zero small?

Sounds like a job for another program. This time we’ll use a nice little Python feature called a closure, which we define a function that saves the information that exists when it’s created for later. It will let us write a function that takes in $c$ and produces a function that transforms according to $z \mapsto z^2+c$.

def squarePlusC(c):
def f(z):
return z*z + c

return f


And we can use the very same classification/graphing function from before to do this.

def classifySquarePlusC(c):
return isSmallForever(0, squarePlusC(c))

grid = classify(classifySquarePlusC, xRange=(-2, 1), yRange=(-1, 1), step=0.005)
gridToImage(grid)


And the result:

Stunning. This wonderful pattern, which is still largely not understood today, is known as the Mandelbrot set. That is, the white points are the points in the Mandlebrot set, and the black points are not in it. The detail on the border of this thing is infinitely intricate. For example, we can change the window in our little program to zoom in on a particular region.

And if you keep zooming in you keep getting more and more detail. This was true of the specific case of $z^2 - 1$, but somehow the patterns in the Mandelbrot set are much more varied and interesting. And if you keep going down eventually you’ll see patterns that look like the original Mandelbrot set. We can already kind of see that happening above. The name for this idea is a fractal, and the $z^2 - 1$ image has it too. Fractals are a fascinating and mysterious subject studied in a field called discrete dynamical systems. Many people dedicate their entire lives to studying these things, and it’s for good reason. There’s a lot to learn and even more that’s unknown!

So this is the end of our journey for now. I’ve posted all of the code we used in the making of this post so you can continue to play, but here are some interesting ideas.

• The Mandelbrot set (and most fractals) are usually colored. The way they’re colored is as follows. Rather than just say true or false when zero blows up beyond 2 in length, you return the number of iterations $k$ that happened. Then you pick a color based on how big $k$ is. There’s a link below that lets you play with this. In fact, adding colors shows that there is even more intricate detail happening outside the Mandelbrot set that’s too faint to see in our pictures above. Such as this.
• Some very simple questions about fractals are very hard to answer. For example, is the Mandelbrot set connected? That is, is it possible to “walk” from every point in the Mandelbrot set to every other point without leaving the set? Despite the scattering of points in the zoomed in picture above that suggest the answer is no, the answer is actually yes! This is a really difficult thing to prove, however.
• The patterns in many fractals are often used to generate realistic looking landscapes and generate pseudo randomness. So fractals are not just mathematical curiosities.
• You should definitely be experimenting with this stuff! What happens if you change the length threshold from 2 to some bigger number? What about a smaller number? What if you do powers different than $2$? There’s so much to explore!
• The big picture thing to take away from this is that it’s not the numbers themselves that are particularly interesting, it’s the transformations of the numbers that generate these patterns! The interesting questions are what kinds of things are the same under these transformations, and what things are different. This is a very general idea in mathematics, and the more math you do the more you’ll find yourself wondering about useful and bizarre transformations.

For the chance to keep playing with the Mandelbrot set, check out this Mandelbrot grapher that works in your browser. It lets you drag rectangles to zoom further in on regions of interest. It’s really fun.

Until next time!

A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$\displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $\wedge$ denotes a logical AND and the vee $\vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $x_i$ or the negation $\overline{x_i}$, and connectives $\wedge$ and $\vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $\vee$‘s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $G$, we’ll construct a set $S_G$ of labeled truth assignments, and we’ll define the distribution $D$ to be the uniform distribution over those truth assignments used in $S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $S_G$ exactly how we labeled them, and we set the allowed error $\varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $G$ with $n$ nodes $v_1, \dots, v_n$ and a set of $m$ undirected edges $E$, we construct a set of examples with positive labels $S^+$ and one with negative examples $S^-$. The examples are truth assignments to $n$ variables, which we label $x_1, \dots, x_n$, and we identify a truth assignment to the $\left \{ 0,1 \right \}$-valued vector $(x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $S^+$ are simple: for each $v_i$ add a truth assignment $x_i = T, x_j = F$ for $j \neq i$. I.e., the binary vector is $(1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $i$-th position.

The negative examples $S^-$ come from the edges. For each edge $(v_i, v_j) \in E$, we add the example with a zero in the $i$-th and $j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

Claim: $G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $\varphi$.

Again, consistent just means that $\varphi$ is satisfied by every truth assignment in $S^+$ and unsatisfied by every example in $S^-$. Since we chose our distribution to be uniform over $S^+ \cup S^-$, we don’t care what $\varphi$ does elsewhere.

Indeed, if $G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $T_R$ be the AND of all the literals $x_i$ for which vertex $v_i$ is not red. For each such $i$, the corresponding example in $S^+$ will satisfy $T_R$, because we put a zero in the $i$-th position and ones everywhere else. Similarly, no example in $S^-$ will make $T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $(v_1,v_2)$. Then the corresponding negative example is $(0,0,1)$. Unless both $v_1$ and $v_2$ are colored red, one of $x_1, x_2$ will have to be ANDed as part of $T_R$. But the example has a zero for both $x_1$ and $x_2$, so $T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $\varphi$. We need to construct a three coloring of $G$. It goes in largely the same way: label the clauses $\varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $v_i$ the color of the clause that is satisfied by the corresponding example in $S^+$. There must be some clause that does this because $\varphi$ is consistent with $S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $(v_i, v_j)$, and both $v_i$ and $v_j$ are colored, say, blue. Look at the clause $T_B$: since $v_i$ and $v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make $T_B$ true. Since those two positive examples differ in both their $i$-th and $j$-th positions, $T_B$ can’t have any of the literals $x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $T_B$ because it has 1’s everywhere except $i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $\delta < 1/2$ and $\varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $1 / |S^+ \cup S^-|$, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $\mathsf{C}$ over a set $X$ is efficiently PAC-learnable using the hypothesis class $\mathsf{H}$ if there exists an algorithm $A(\varepsilon, \delta)$ with access to a query function for $\mathsf{C}$ and runtime $O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $c \in \mathsf{C}$, all distributions $D$ over $X$, and all $0 < \delta , \varepsilon < 1/2$, the probability that $A$ produces a hypothesis $h \in \mathsf{H}$ with error at most $\varepsilon$ is at least $1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!

Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability $p$. If $X$ denotes the number of trials needed to get the first success, then clearly $\Pr(X = k) = (1-p)^{k-1} p$ (since first we need $k-1$ failures which occur independently with probability $1-p$, then we need one success which happens with probability $p$). Thus the expected value of $X$ is

$\displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p$

by basic calculus. In particular, if success is defined as getting a six, then $p=1/6$ thus the expected time is $1/p=6$.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is $\frac{1}{26}$, thus the probability of a random eleven-letter word being ABRACADABRA is exactly $\left(\frac{1}{26}\right)^{11}$. So if typing 11 letters is one trial, the expected number of trials is

$\displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}$

which means $11\cdot 26^{11}$ keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that $11\cdot 26^{11}$ is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses$1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive$26 if he wins, since the probability of getting the next letter right is exactly $\frac{1}{26}$ (thus the expected value of the change in the gambler’s fortune is $\frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0$.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as $T$. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly $T$ dollars. How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins$26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins $\26^4$. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be $\26^{11}$. Thus our casino will have to give out $26^{11}+26^4+26$ dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is $T$ dollars, the expected value of our expenses is $26^{11}+26^4+26$ dollars, thus $E(T)=26^{11}+26^4+26$. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. $X_0$ will denote the gambler’s fortune before the game starts, $X_1$ the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of $X_n$, given $X_1, X_2, \ldots, X_{n-1}$ is the same as $X_{n-1}$. This can be written as $E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1}$ or, equivalently, $E(X_n - X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0$.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let $X'_n = X_n$ if $n < T$ and $X'_n = X_T$ if $n \ge T$ where $T$ denotes the stopping time, i.e. the time at which the desired event occurs. Notice that $T$ itself is a random variable.

We require our stopping time $T$ to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let $X_n$ be a martingale stopped at step $T$, and suppose one of the following three conditions hold:

1. The stopping time $T$ is almost surely bounded by some constant;
2. The stopping time $T$ is almost surely finite and every step of the stopped martingale $X_n$ is almost surely bounded by some constant; or
3. The expected stopping time $E(T)$ is finite and the absolute value of the martingale increments $|X_n-X_{n-1}|$ are almost surely bounded by a constant.

Then $E(X_T) = E(X_0).$

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than $26^{12}$) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by $26^{11}+26^4+26$. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has $a$ dollars, the second player has $b$ dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let $X_n$ denote the change in the second player’s fortune, and set $X_0 = 0$. Let $T_k$ denote the first time $s$ when $X_s = k$. Then our first question can be formalized as trying to determine $\Pr(T_{-b} < T_a)$. Let $t = \min \{ T_{-b}, T_a\}$. Clearly $t$ is a stopping time. By the optional stopping theorem we have that

$\displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))$

thus $\Pr(T_{-b} < T_a)=\frac{a}{a+b}$.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: $X_n^2-n$ is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with $k$-SAT for some $k \ne 3$ instead? Clearly the hardness of the problem is monotone increasing in $k$ since $k$-SAT is a special case of $(k+1)$-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as $k$-SAT for any $k>3$. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals $x$ and $y$ are connected by a directed edge if there is a clause $(\bar x \lor y)$. Recall that $\bar x \lor y$ is equivalent to $x \implies y$, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths $x \to \bar x$ and $\bar x \to x$ (since $x \Leftrightarrow \bar x$ is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after $O(n^2)$ rounds where $n$ denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in $O(n^2)$ steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by $\pm 1$ in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least $\frac{1}{2}$. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond $n$. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number $n$, on the other hand, is called a reflecting barrier: we cannot reach $n+1$, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in $O(n^2)$ steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .