Community Detection in Graphs — a Casual Tour

Graphs are among the most interesting and useful objects in mathematics. Any situation or idea that can be described by objects with connections is a graph, and one of the most prominent examples of a real-world graph that one can come up with is a social network.

Recall, if you aren’t already familiar with this blog’s gentle introduction to graphs, that a graph G is defined by a set of vertices V, and a set of edges E, each of which connects two vertices. For this post the edges will be undirected, meaning connections between vertices are symmetric.

One of the most common topics to talk about for graphs is the notion of a community. But what does one actually mean by that word? It’s easy to give an informal definition: a subset of vertices C such that there are many more edges between vertices in C than from vertices in C to vertices in V - C (the complement of C). Try to make this notion precise, however, and you open a door to a world of difficult problems and open research questions. Indeed, nobody has yet come to a conclusive and useful definition of what it means to be a community. In this post we’ll see why this is such a hard problem, and we’ll see that it mostly has to do with the word “useful.” In future posts we plan to cover some techniques that have found widespread success in practice, but this post is intended to impress upon the reader how difficult the problem is.

The simplest idea

The simplest thing to do is to say a community is a subset of vertices which are completely connected to each other. In the technical parlance, a community is a subgraph which forms a clique. Sometimes an n-clique is also called a complete graph on n vertices, denoted K_n. Here’s an example of a 5-clique in a larger graph:

"Where's Waldo" for graph theorists: a clique hidden in a larger graph.

“Where’s Waldo” for graph theorists: a clique hidden in a larger graph.

Indeed, it seems reasonable that if we can reliably find communities at all, then we should be able to find cliques. But as fate should have it, this problem is known to be computationally intractable. In more detail, the problem of finding the largest clique in a graph is NP-hard. That essentially means we don’t have any better algorithms to find cliques in general graphs than to try all possible subsets of the vertices and check to see which, if any, form cliques. In fact it’s much worse, this problem is known to be hard to approximate to any reasonable factor in the worst case (the error of the approximation grows polynomially with the size of the graph!). So we can’t even hope to find a clique half the size of the biggest, or a thousandth the size!

But we have to take these impossibility results with a grain of salt: they only say things about the worst case graphs. And when we’re looking for communities in the real world, the worst case will never show up. Really, it won’t! In these proofs, “worst case” means that they encode some arbitrarily convoluted logic problem into a graph, so that finding the clique means solving the logic problem. To think that someone could engineer their social network to encode difficult logic problems is ridiculous.

So what about an “average case” graph? To formulate this typically means we need to consider graphs randomly drawn from a distribution.

Random graphs

The simplest kind of “randomized” graph you could have is the following. You fix some set of vertices, and then run an experiment: for each pair of vertices you flip a coin, and if the coin is heads you place an edge and otherwise you don’t. This defines a distribution on graphs called G(n, 1/2), which we can generalize to G(n, p) for a coin with bias p. With a slight abuse of notation, we call G(n, p) the Erdős–Rényi random graph (it’s not a graph but a distribution on graphs). We explored this topic form a more mathematical perspective earlier on this blog.

So we can sample from this distribution and ask questions like: what’s the probability of the largest clique being size at least 20? Indeed, cliques in Erdős–Rényi random graphs are so well understood that we know exactly how they work. For example, if p=1/2 then the size of the largest clique is guaranteed (with overwhelming probability as n grows) to have size k(n) or k(n)+1, where k(n) is about 2 \log n. Just as much is known about other values of p as well as other properties of G(n,p), see Wikipedia for a short list.

In other words, if we wanted to find the largest clique in an Erdős–Rényi random graph, we could check all subsets of size roughly 2\log(n), which would take about (n / \log(n))^{\log(n)} time. This is pretty terrible, and I’ve never heard of an algorithm that does better (contrary to the original statement in this paragraph that showed I can’t count). In any case, it turns out that the Erdős–Rényi random graph, and using cliques to represent communities, is far from realistic. There are many reasons why this is the case, but here’s one example that fits with the topic at hand. If I thought the world’s social network was distributed according to G(n, 1/2) and communities were cliques, then I would be claiming that the largest community is of size 65 or 66. Estimated world population: 7 billion, 2 \log(7 \cdot 10^9) \sim 65. Clearly this is ridiculous: there are groups of larger than 66 people that we would want to call “communities,” and there are plenty of communities that don’t form bona-fide cliques.

Another avenue shows that things are still not as easy as they seem in Erdős–Rényi land. This is the so-called planted clique problem. That is, you draw a graph G from G(n, 1/2). You give G to me and I pick a random but secret subset of r vertices and I add enough edges to make those vertices form an r-clique. Then I ask you to find the r-clique. Clearly it doesn’t make sense when r < 2 \log (n) because you won’t be able to tell it apart from the guaranteed cliques in G. But even worse, nobody knows how to find the planted clique when r is even a little bit smaller than \sqrt{n} (like, r = n^{9/20} even). Just to solidify this with some numbers, we don’t know how to reliably find a planted clique of size 60 in a random graph on ten thousand vertices, but we do when the size of the clique goes up to 100. The best algorithms we know rely on some sophisticated tools in spectral graph theory, and their details are beyond the scope of this post.

So Erdős–Rényi graphs seem to have no hope. What’s next? There are a couple of routes we can take from here. We can try to change our random graph model to be more realistic. We can relax our notion of communities from cliques to something else. We can do both, or we can do something completely different.

Other kinds of random graphs

There is an interesting model of Barabási and Albert, often called the “preferential attachment” model, that has been described as a good model of large, quickly growing networks like the internet. Here’s the idea: you start off with a two-clique G = K_2, and at each time step t you add a new vertex v to G, and new edges so that the probability that the edge (v,w) is added to G is proportional to the degree of w (as a fraction of the total number of edges in G). Here’s an animation of this process:

Image source: Wikipedia

The significance of this random model is that it creates graphs with a small number of hubs, and a large number of low-degree vertices. In other words, the preferential attachment model tends to “make the rich richer.” Another perspective is that the degree distribution of such a graph is guaranteed to fit a so-called power-law distribution. Informally, this means that the overall fraction of small-degree vertices gives a significant contribution to the total number of edges. This is sometimes called a “fat-tailed” distribution. Since power-law distributions are observed in a wide variety of natural settings, some have used this as justification for working in the preferential attachment setting. On the other hand, this model is known to have no significant community structure (by any reasonable definition, certainly not having cliques of nontrivial size), and this has been used as evidence against the model. I am not aware of any work done on planting dense subgraphs in graphs drawn from a preferential attachment model, but I think it’s likely to be trivial and uninteresting. On the other hand, Bubeck et al. have looked at changing the initial graph (the “seed”) from a 2-clique to something else, and seeing how that affects the overall limiting distribution.

Another model that often shows up is a model that allows one to make a random graph starting with any fixed degree distribution, not just a power law. There are a number of models that do this to some fashion, and you’ll hear a lot of hyphenated names thrown around like Chung-Lu and Molloy-Reed and Newman-Strogatz-Watts. The one we’ll describe is quite simple. Say you start with a set of vertices V, and a number d_v for each vertex v, such that the sum of all the d_v is even. This condition is required because in any graph the sum of the degrees of a vertex is twice the number of edges. Then you imagine each vertex v having d_v “edge-stubs.” The name suggests a picture like the one below:

Each node has a prescribed number of "edge stubs," which are randomly connected to form a graph.

Each node has a prescribed number of “edge stubs,” which are randomly connected to form a graph.

Now you pick two edge stubs at random and connect them. One usually allows self-loops and multiple edges between vertices, so that it’s okay to pick two edge stubs from the same vertex. You keep doing this until all the edge stubs are accounted for, and this is your random graph. The degrees were fixed at the beginning, so the only randomization is in which vertices are adjacent. The same obvious biases apply, that any given vertex is more likely to be adjacent to high-degree vertices, but now we get to control the biases with much more precision.

The reason such a model is useful is that when you’re working with graphs in the real world, you usually have statistical information available. It’s simple to compute the degree of each vertex, and so you can use this random graph as a sort of “prior” distribution and look for anomalies. In particular, this is precisely how one of the leading measures of community structure works: the measure of modularity. We’ll talk about this in the next section.

Other kinds of communities

Here’s one easy way to relax our notion of communities. Rather than finding complete subgraphs, we could ask about finding very dense subgraphs (ignoring what happens outside the subgraph). We compute density as the average degree of vertices in the subgraph.

If we impose no bound on the size of the subgraph an algorithm is allowed to output, then there is an efficient algorithm for finding the densest subgraph in a given graph. The general exact solution involves solving a linear programming problem and a little extra work, but luckily there is a greedy algorithm that can get within half of the optimal density. You start with all the vertices S_n = V, and remove any vertex of minimal degree to get S_{n-1}. Continue until S_0, and then compute the density of all the S_i. The best one is guaranteed to be at least half of the optimal density. See this paper of Moses Charikar for a more formal analysis.

One problem with this is that the size of the densest subgraph might be too big. Unfortunately, if you fix the size of the dense subgraph you’re looking for (say, you want to find the densest subgraph of size at most k where k is an input), then the problem once again becomes NP-hard and suffers from the same sort of inapproximability theorems as finding the largest clique.

A more important issue with this is that a dense subgraph isn’t necessarily a community. In particular, we want communities to be dense on the inside and sparse on the outside. The densest subgraph analysis, however, might rate the following graph as one big dense subgraph instead of two separately dense communities with some modest (but not too modest) amount of connections between them.

What should be identified as communities?

What are the correct communities here?

Indeed, we want a quantifiable a notion of “dense on the inside and sparse on the outside.” One such formalization is called modularity. Modularity works as follows. If you give me some partition of the vertices of G into two sets, modularity measures how well this partition reflects two separate communities. It’s the definition of “community” here that makes it interesting. Rather than ask about densities exactly, you can compare the densities to the expected densities in a given random graph model.

In particular, we can use the fixed-degree distribution model from the last section. If we know the degrees of all the vertices ahead of time, we can compute the probability that we see some number of edges going between the two pieces of the partition relative to what we would see at random. If the difference is large (and largely biased toward fewer edges across the partition and more edges within the two subsets), then we say it has high modularity. This involves a lot of computations  — the whole measure can be written as a quadratic form via one big matrix — but the idea is simple enough. We intend to write more about modularity and implement the algorithm on this blog, but the excited reader can see the original paper of M.E.J. Newman.

Now modularity is very popular but it too has shortcomings. First, even though you can compute the modularity of a given partition, there’s still the problem of finding the partition that globally maximizes modularity. Sadly, this is known to be NP-hard. Mover, it’s known to be NP-hard even if you’re just trying to find a partition into two pieces that maximizes modularity, and even still when the graph is regular (every vertex has the same degree).

Still worse, while there are some readily accepted heuristics that often “do well enough” in practice, we don’t even know how to approximate modularity very well. Bhaskar DasGupta has a line of work studying approximations of maximum modularity, and he has proved that for dense graphs you can’t even approximate modularity to within any constant factor. That is, the best you can do is have an approximation that gets worse as the size of the graph grows. It’s similar to the bad news we had for finding the largest clique, but not as bad. For example, when the graph is sparse it’s known that one can approximate modularity to within a \log(n) factor of the optimum, where n is the number of vertices of the graph (for cliques the factor was like n^c for some c, and this is drastically worse).

Another empirical issue is that modularity seems to fail to find small communities. That is, if your graph has some large communities and some small communities, strictly maximizing the modularity is not the right thing to do. So we’ve seen that even the leading method in the field has some issues.

Something completely different

The last method I want to sketch is in the realm of “something completely different.” The notion is that if we’re given a graph, we can run some experiment on the graph, and the results of that experiment can give us insight into where the communities are.

The experiment I’m going to talk about is the random walk. That is, say you have a vertex v in a graph G and you want to find some vertices that are “closest” to v. That is, those that are most likely to be in the same community as v. What you can do is run a random walk starting at v. By a “random walk” I mean you start at v, you pick a neighbor at random and move to it, then repeat. You can compute statistics about the vertices you visit in a sample of such walks, and the vertices that you visit most often are those you say are “in the same community as v. One important parameter is how long the walk is, but it’s generally believed to be best if you keep it between 3-6 steps.

Of course, this is not a partition of the vertices, so it’s not a community detection algorithm, but you can turn it into one. Run this process for each vertex, and use it to compute a “distance” between all the pairs of vertices. Then you compute a tree of partitions by lumping the closest pairs of vertices into the same community, one at a time, until you’ve got every vertex. At each step of the way, you compute the modularity of the partition, and when you’re done you choose the partition that maximizes modularity. This algorithm as a whole is called the walktrap clustering algorithm, and was introduced by Pons and Latapy in 2005.

This sounds like a really great idea, because it’s intuitive: there’s a relatively high chance that the friends of your friends are also your friends. It’s also really great because there is an easily measurable tradeoff between runtime and quality: you can tune down the length of the random walk, and the number of samples you take for each vertex, to speed up the runtime but lower the quality of your statistical estimates. So if you’re working on huge graphs, you get a lot of control and a clear idea of exactly what’s going on inside the algorithm (something which is not immediately clear in a lot of these papers).

Unfortunately, I’m not aware of any concrete theoretical guarantees for walktrap clustering. The one bit of theoretical justification I’ve read over the last year is that you can relate the expected distances you get to certain spectral properties of the graph that are known to be related to community structure, but the lower bounds on maximizing modularity already suggest (though they do not imply) that walktrap won’t do that well in the worst case.

So many algorithms, so little time!

I have only brushed the surface of the literature on community detection, and the things I have discussed are heavily biased toward what I’ve read about and used in my own research. There are methods based on information theory, label propagation, and obscure physics processes like “spin glass” (whatever that is, it sounds frustrating).

And we have only been talking about perfect community structure. What if you want to allow people to be in multiple communities, or have communities at varying levels of granularity (e.g. a sports club within a school versus the whole student body of that school)? What if we want to allow people to be “members” of a community at varying degrees of intensity? How do we deal with noisy signals in our graphs? For example, if we get our data from observing people talk, are two people who have heated arguments considered to be in the same community? Since a lot social network data comes from sources like Twitter and Facebook where arguments are rampant, how do we distinguish between useful and useless data? More subtly, how do we determine useful information if a group within the social network are trying to mask their discovery? That is, how do we deal with adversarial noise in a graph?

And all of this is just on static graphs! What about graphs that change over time? You can keep making the problem more and more complicated as it gets more realistic.

With the huge wealth of research that has already been done just on the simplest case, and the difficult problems and known barriers to success even for the simple problems, it seems almost intimidating to even begin to try to answer these questions. But maybe that’s what makes them fascinating, not to mention that governments and big businesses pour many millions of dollars into this kind of research.

In the future of this blog we plan to derive and implement some of the basic methods of community detection. This includes, as a first outline, the modularity measure and the walktrap clustering algorithm. Considering that I’m also going to spend a large part of the summer thinking about these problems (indeed, with some of the leading researchers and upcoming stars under the sponsorship of the American Mathematical Society), it’s unlikely to end there.

Until next time!

Programming with Finite Fields

Back when I was first exposed to programming language design, I decided it would be really cool if there were a language that let you define your own number types and then do all your programming within those number types. And since I get excited about math, I think of really exotic number types (Boolean rings, Gaussian integers, Octonions, oh my!). I imagined it would be a language feature, so I could do something like this:

use gaussianintegers as numbers

x = 1 + i
y = 2 - 3i
print(x*y)

z = 2 + 3.5i     # error

I’m not sure why I thought this would be so cool. Perhaps I felt like I would be teaching a computer math. Or maybe the next level of abstraction in playing god by writing programs is to play god by designing languages (and I secretly satisfy a massive god complex by dictating the actions of my computer).

But despite not writing a language of my own, programming with weird number systems still has a special place in my heart. It just so happens that we’re in the middle of a long series on elliptic curves, and in the next post we’ll actually implement elliptic curve arithmetic over a special kind of number type (the finite field). In this post, we’ll lay the groundwork by implementing number types in Python that allow us to work over any finite field. This is actually a pretty convoluted journey, and to be totally rigorous we’d need to prove a bunch of lemmas, develop a bunch of ring theory, and prove the correctness of a few algorithms involving polynomials.

Instead of taking the long and winding road, we’ll just state the important facts with links to proofs, prove the easy stuff, and focus more heavily than usual on the particular Python implementation details. As usual, all of the code used in this post is available on this blog’s Github page.

Integers Modulo Primes

The simples kind of finite field is the set of integers modulo a prime. We’ve dealt with this number field extensively on this blog (in groups, rings, fields, with RSA, etc.), but let’s recall what it is. The modulo operator \mod (in programming it’s often denoted %) is a binary operation on integers such that x \mod y is the unique positive remainder of x when divided by y.

Definition: Let p be a prime number. The set \mathbb{Z}/p consists of the numbers \left \{ 0, 1, \dots, p-1 \right \}. If you endow it with the operations of addition (mod p) and multiplication (mod p), it forms a field.

To say it’s a field is just to say that arithmetic more or less behaves the way we expect it to, and in particular that every nonzero element has a (unique) multiplicative inverse. Making a number type for \mathbb{Z}/p in Python is quite simple.

def IntegersModP(p):
   class IntegerModP(FieldElement):
      def __init__(self, n):
         self.n = n % p
         self.field = IntegerModP

      def __add__(self, other): return IntegerModP(self.n + other.n)
      def __sub__(self, other): return IntegerModP(self.n - other.n)
      def __mul__(self, other): return IntegerModP(self.n * other.n)
      def __truediv__(self, other): return self * other.inverse()
      def __div__(self, other): return self * other.inverse()
      def __neg__(self): return IntegerModP(-self.n)
      def __eq__(self, other): return isinstance(other, IntegerModP) and self.n == other.n
      def __abs__(self): return abs(self.n)
      def __str__(self): return str(self.n)
      def __repr__(self): return '%d (mod %d)' % (self.n, self.p)

      def __divmod__(self, divisor):
         q,r = divmod(self.n, divisor.n)
         return (IntegerModP(q), IntegerModP(r))

      def inverse(self):
         ...?

   IntegerModP.p = p
   IntegerModP.__name__ = 'Z/%d' % (p)
   return IntegerModP

We’ve done a couple of things worth note here. First, all of the double-underscore methods are operator overloads, so they are called when one tries to, e.g., add two instances of this class together. We’ve also implemented a division algorithm via __divmod__ which computes a (quotient, remainder) pair for the appropriate division. The built in Python function divmod function does this for integers, and we can overload it for a custom type. We’ll write a more complicated division algorithm later in this post. Finally, we’re dynamically creating our class so that different primes will correspond to different types. We’ll come back to why this encapsulation is a good idea later, but it’s crucial to make our next few functions reusable and elegant.

Here’s an example of the class in use:

>>> mod7 = IntegersModP(7)
>>> mod7(3) + mod7(6)
2 (mod 7)

The last (undefined) function in the IntegersModP class, the inverse function, is our only mathematical hurdle. Luckily, we can compute inverses in a generic way, using an algorithm called the extended Euclidean algorithm. Here’s the mathematics.

Definition: An element d is called a greatest common divisor (gcd) of a,b if it divides both a and b, and for every other z dividing both a and b, z divides d. For \mathbb{Z}/p gcd’s and we denote it as \gcd(a,b). [1]

Note that we called it ‘a’ greatest common divisor. In general gcd’s need not be unique, though for integers one often picks the positive gcd. We’ll actually see this cause a tiny programmatic bug later in this post, but let’s push on for now.

Theorem: For any two integers a,b \in \mathbb{Z} there exist unique x,y \in \mathbb{Z} such that ax + by = \gcd(a,b).

We could beat around the bush and try to prove these things in various ways, but when it comes down to it there’s one algorithm of central importance that both computes the gcd and produces the needed linear combination x,y. The algorithm is called the Euclidean algorithm. Here is a simple version that just gives the gcd.

def gcd(a, b):
   if abs(a) < abs(b):
      return gcd(b, a)

   while abs(b) > 0:
      q,r = divmod(a,b)
      a,b = b,r

   return a

This works by the simple observation that \gcd(a, aq+r) = \gcd(a,r) (this is an easy exercise to prove directly). So the Euclidean algorithm just keeps applying this rule over and over again: take the remainder when dividing the bigger argument by the smaller argument until the remainder becomes zero. Then the \gcd(x,0) = x because everything divides zero.

Now the so-called ‘extended’ Euclidean algorithm just keeps track of some additional data as it goes (the partial quotients and remainders). Here’s the algorithm.

def extendedEuclideanAlgorithm(a, b):
   if abs(b) > abs(a):
      (x,y,d) = extendedEuclideanAlgorithm(b, a)
      return (y,x,d)

   if abs(b) == 0:
      return (1, 0, a)

   x1, x2, y1, y2 = 0, 1, 1, 0
   while abs(b) > 0:
      q, r = divmod(a,b)
      x = x2 - q*x1
      y = y2 - q*y1
      a, b, x2, x1, y2, y1 = b, r, x1, x, y1, y

   return (x2, y2, a)

Indeed, the reader who hasn’t seen this stuff before is encouraged to trace out a run for the numbers 4864, 3458. Their gcd is 38 and the two integers are 32 and -45, respectively.

How does this help us compute inverses? Well, if we want to find the inverse of a modulo p, we know that their gcd is 1. So compute the x,y such that ax + py = 1, and then reduce both sides mod p. You get ax + 0 = 1 \mod p, which means that x \mod p is the inverse of a. So once we have the extended Euclidean algorithm our inverse function is trivial to write!

def inverse(self):
   x,y,d = extendedEuclideanAlgorithm(self.n, self.p)
   return IntegerModP(x)

And indeed it works as expected:

>>> mod23 = IntegersModP(23)
>>> mod23(7).inverse()
10 (mod 23)
>>> mod23(7).inverse() * mod23(7)
1 (mod 23)

Now one very cool thing, and something that makes some basic ring theory worth understanding, is that we can compute the gcd of any number type using the exact same code for the Euclidean algorithm, provided we implement an abs function and a division algorithm. Via a chain of relatively easy-to-prove lemmas, if your number type has enough structure (in particular, if it has a division algorithm that satisfies some properties), then greatest common divisors are well-defined, and the Euclidean algorithm gives us that special linear combination. And using the same trick above in finite fields, we can use the Euclidean algorithm to compute inverses.

But in order to make things work programmatically we need to be able to deal with the literal ints 0 and 1 in the algorithm. That is, we need to be able to silently typecast integers up to whatever number type we’re working with. This makes sense because all rings have 0 and 1, but it requires a bit of scaffolding to implement. In particular, typecasting things sensibly is really difficult if you aren’t careful. And the problems are compounded in a language like Python that blatantly ignores types whenever possible. [2]

So let’s take a quick break to implement a tiny type system with implicit typecasting.

[1] The reader familiar with our series on category theory will recognize this as the product of two integers in a category whose arrows represent divisibility. So by abstract nonsense, this proves that gcd’s are unique up to multiplication by a unit in any ring.
[2] In the process of writing the code for this post, I was sorely missing the stronger type systems of Java and Haskell. Never thought I’d say that, but it’s true.

A Tiny Generic Type System

The main driving force behind our type system will be a decorator called @typecheck. We covered decorators toward the end of our primer on dynamic programming, but in short a decorator is a Python syntax shortcut that allows some pre- or post-processing to happen to a function in a reusable way. All you need to do to apply the pre/post-processing is prefix the function definition with the name of the decorator.

Our decorator will be called typecheck, and it will decorate binary operations on our number types. In its basic form, our type checker will work as follows: if the types of the two operands are the same, then the decorator will just pass them on through to the operator. Otherwise, it will try to do some typecasting, and if that fails it will raise exceptions with reckless abandon.

def typecheck(f):
   def newF(self, other):
      if type(self) is not type(other):
         try:
            other = self.__class__(other)
         except TypeError:
            message = 'Not able to typecast %s of type %s to type %s in function %s'
            raise TypeError(message % (other, type(other).__name__, type(self).__name__, f.__name__))
         except Exception as e:
            message = 'Type error on arguments %r, %r for functon %s. Reason:%s'
            raise TypeError(message % (self, other, f.__name__, type(other).__name__, type(self).__name__, e))

      return f(self, other)

   return newF

So this is great, but there are two issues. The first is that this will only silently typecast if the thing we’re casting is on the right-hand side of the expression. In other words, the following will raise an exception complaining that you can’t add ints to Mod7 integers.

>>> x = IntegersModP(7)(1)
>>> 1 + x

What we need are the right-hand versions of all the operator overloads. They are the same as the usual operator overloads, but Python gives preference to the left-hand operator overloads. Anticipating that we will need to rewrite these silly right-hand overloads for every number type, and they’ll all be the same, we make two common base classes.

class DomainElement(object):
   def __radd__(self, other): return self + other
   def __rsub__(self, other): return -self + other
   def __rmul__(self, other): return self * other

class FieldElement(DomainElement):
   def __truediv__(self, other): return self * other.inverse()
   def __rtruediv__(self, other): return self.inverse() * other
   def __div__(self, other): return self.__truediv__(other)
   def __rdiv__(self, other): return self.__rtruediv__(other)

And we can go ahead and make our IntegersModP a subclass of FieldElement. [3]

But now we’re wading into very deep waters. In particular, we know ahead of time that our next number type will be for Polynomials (over the integers, or fractions, or \mathbb{Z}/p, or whatever). And we’ll want to do silent typecasting from ints and IntegersModP to Polynomials! The astute reader will notice the discrepancy. What will happen if I try to do this?

>>> MyInteger() + MyPolynomial()

Let’s take this slowly: by our assumption both MyInteger and MyPolynomial have the __add__ and __radd__ functions defined on them, and each tries to typecast the other the appropriate type. But which is called? According to Python’s documentation if the left-hand side has an __add__ function that’s called first, and the right-hand sides’s __radd__ function is only sought if no __add__ function is found for the left operand.

Well that’s a problem, and we’ll deal with it in a half awkward and half elegant way. What we’ll do is endow our number types with an “operatorPrecedence” constant. And then inside our type checker function we’ll see if the right-hand operand is an object of higher precedence. If it is, we return the global constant NotImplemented, which Python takes to mean that no __add__ function was found, and it proceeds to look for __radd__. And so with this modification our typechecker is done. [4]

def typecheck(f):
   def newF(self, other):
      if (hasattr(other.__class__, 'operatorPrecedence') and
            other.__class__.operatorPrecedence > self.__class__.operatorPrecedence):
         return NotImplemented

      if type(self) is not type(other):
         try:
            other = self.__class__(other)
         except TypeError:
            message = 'Not able to typecast %s of type %s to type %s in function %s'
            raise TypeError(message % (other, type(other).__name__, type(self).__name__, f.__name__))
         except Exception as e:
            message = 'Type error on arguments %r, %r for functon %s. Reason:%s'
            raise TypeError(message % (self, other, f.__name__, type(other).__name__, type(self).__name__, e))

      return f(self, other)

   return newF

We add a default operatorPrecedence of 1 to the DomainElement base class. Now this function answers our earlier question of why we want to encapsulate the prime modulus into the IntegersModP class. If this typechecker is really going to be generic, we need to be able to typecast an int by passing the single int argument to the type constructor with no additional information! Indeed, this will be the same pattern for our polynomial class and the finite field class to follow.

Now there is still one subtle problem. If we try to generate two copies of the same number type from our number-type generator (in other words, the following code snippet), we’ll get a nasty exception.

>>> mod7 = IntegersModP(7)
>>> mod7Copy = IntegersModP(7)
>>> mod7(1) + mod7Copy(2)
... fat error ...

The reason for this is that in the type-checker we’re using the Python built-in ‘is’ which checks for identity, not semantic equality. To fix this, we simply need to memoize the IntegersModP function (and all the other functions we’ll use to generate number types) so that there is only ever one copy of a number type in existence at a time.

So enough Python hacking: let’s get on with implementing finite fields!

[3] This also compels us to make some slight modifications to the constructor for IntegersModP, but they’re not significant enough to display here. Check out the Github repo if you want to see.
[4] This is truly a hack, and we’ve considered submitting a feature request to the Python devs. It is conceivably useful for the operator-overloading aficionado. I’d be interested to hear your thoughts in the comments as to whether this is a reasonable feature to add to Python.

Polynomial Arithmetic

Recall from our finite field primer that every finite field can be constructed as a quotient of a polynomial ring with coefficients in \mathbb{Z}/p by some prime ideal. We spelled out exactly what this means in fine detail in the primer, so check that out before reading on.

Indeed, to construct a finite field we need to find some irreducible monic polynomial f with coefficients in \mathbb{Z}/p, and then the elements of our field will be remainders of arbitrary polynomials when divided by f. In order to determine if they’re irreducible we’ll need to compute a gcd. So let’s build a generic polynomial type with a polynomial division algorithm, and hook it into our gcd framework.

We start off in much the same way as with the IntegersModP:

# create a polynomial with coefficients in a field; coefficients are in
# increasing order of monomial degree so that, for example, [1,2,3]
# corresponds to 1 + 2x + 3x^2
@memoize
def polynomialsOver(field=fractions.Fraction):

   class Polynomial(DomainElement):
      operatorPrecedence = 2
      factory = lambda L: Polynomial([field(x) for x in L])

      def __init__(self, c):
         if type(c) is Polynomial:
            self.coefficients = c.coefficients
         elif isinstance(c, field):
            self.coefficients = [c]
         elif not hasattr(c, '__iter__') and not hasattr(c, 'iter'):
            self.coefficients = [field(c)]
         else:
            self.coefficients = c

         self.coefficients = strip(self.coefficients, field(0))

      def isZero(self): return self.coefficients == []

      def __repr__(self):
         if self.isZero():
            return '0'

         return ' + '.join(['%s x^%d' % (a,i) if i > 0 else '%s'%a
                              for i,a in enumerate(self.coefficients)])

      def __abs__(self): return len(self.coefficients)
      def __len__(self): return len(self.coefficients)
      def __sub__(self, other): return self + (-other)
      def __iter__(self): return iter(self.coefficients)
      def __neg__(self): return Polynomial([-a for a in self])

      def iter(self): return self.__iter__()
      def leadingCoefficient(self): return self.coefficients[-1]
      def degree(self): return abs(self) - 1

All of this code just lays down conventions. A polynomial is a list of coefficients (in increasing order of their monomial degree), the zero polynomial is the empty list of coefficients, and the abs() of a polynomial is one plus its degree. [5] Finally, instead of closing over a prime modulus, as with IntegersModP, we’re closing over the field of coefficients. In general you don’t have to have polynomials with coefficients in a field, but if they do come from a field then you’re guaranteed to get a sensible Euclidean algorithm. In the formal parlance, if k is a field then k[x] is a Euclidean domain. And for our goal of defining finite fields, we will always have coefficients from \mathbb{Z}/p, so there’s no problem.

Now we can define things like addition, multiplication, and equality using our typechecker to silently cast and watch for errors.

      @typecheck
      def __eq__(self, other):
         return self.degree() == other.degree() and all([x==y for (x,y) in zip(self, other)])

      @typecheck
      def __add__(self, other):
         newCoefficients = [sum(x) for x in itertools.zip_longest(self, other, fillvalue=self.field(0))]
         return Polynomial(newCoefficients)

      @typecheck
      def __mul__(self, other):
         if self.isZero() or other.isZero():
            return Zero()

         newCoeffs = [self.field(0) for _ in range(len(self) + len(other) - 1)]

         for i,a in enumerate(self):
            for j,b in enumerate(other):
               newCoeffs[i+j] += a*b

         return Polynomial(newCoeffs)

Notice that, if the underlying field of coefficients correctly implements the operator overloads, none of this depends on the coefficients. Reusability, baby!

And we can finish off with the division algorithm for polynomials.

      @typecheck
      def __divmod__(self, divisor):
         quotient, remainder = Zero(), self
         divisorDeg = divisor.degree()
         divisorLC = divisor.leadingCoefficient()

         while remainder.degree() >= divisorDeg:
            monomialExponent = remainder.degree() - divisorDeg
            monomialZeros = [self.field(0) for _ in range(monomialExponent)]
            monomialDivisor = Polynomial(monomialZeros + [remainder.leadingCoefficient() / divisorLC])

            quotient += monomialDivisor
            remainder -= monomialDivisor * divisor

         return quotient, remainder

Indeed, we’re doing nothing here but formalizing the grade-school algorithm for doing polynomial long division [6]. And we can finish off the function for generating this class by assigning the field member variable along with a class name. And we give it a higher operator precedence than the underlying field of coefficients so that an isolated coefficient is cast up to a constant polynomial.

@memoize
def polynomialsOver(field=fractions.Fraction):

   class Polynomial(DomainElement):
      operatorPrecedence = 2

      [... methods defined above ...]

   def Zero():
      return Polynomial([])

   Polynomial.field = field
   Polynomial.__name__ = '(%s)[x]' % field.__name__
   return Polynomial

We provide a modest test suite in the Github repository for this post, but here’s a sample test:

>>> Mod5 = IntegersModP(5)
>>> Mod11 = IntegersModP(11)
>>> polysOverQ = polynomialsOver(Fraction).factory
>>> polysMod5 = polynomialsOver(Mod5).factory
>>> polysMod11 = polynomialsOver(Mod11).factory
>>> polysOverQ([1,7,49]) / polysOverQ([7])
1/7 + 1 x^1 + 7 x^2
>>> polysMod5([1,7,49]) / polysMod5([7])
3 + 1 x^1 + 2 x^2
>>> polysMod11([1,7,49]) / polysMod11([7])
8 + 1 x^1 + 7 x^2

And indeed, the extended Euclidean algorithm works without modification, so we know our typecasting is doing what’s expected:

>>> p = polynomialsOver(Mod2).factory
>>> f = p([1,0,0,0,1,1,1,0,1,1,1]) # x^10 + x^9 + x^8 + x^6 + x^5 + x^4 + 1
>>> g = p([1,0,1,1,0,1,1,0,0,1])   # x^9 + x^6 + x^5 + x^3 + x^1 + 1
>>> theGcd = p([1,1,0,1]) # x^3 + x + 1
>>> x = p([0,0,0,0,1]) # x^4
>>> y = p([1,1,1,1,1,1]) # x^5 + x^4 + x^3 + x^2 + x + 1
>>> (x,y,theGcd) == extendedEuclideanAlgorithm(f, g)
True

[5] The mathematical name for the abs() function that we’re using is a valuation.
[6] One day we will talk a lot more about polynomial long division on this blog. You can do a lot of cool algebraic geometry with it, and the ideas there lead you to awesome applications like robot motion planning and automated geometry theorem proving.

Generating Irreducible Polynomials

Now that we’ve gotten Polynomials out of the way, we need to be able to generate irreducible polynomials over \mathbb{Z}/p of any degree we want. It might be surprising that irreducible polynomials of any degree exist [7], but in fact we know a lot more.

Theorem: The product of all irreducible monic polynomials of degree dividing m is equal to x^{p^m} - x.

This is an important theorem, but it takes a little bit more field theory than we have under our belts right now. You could summarize the proof by saying there is a one-to-one correspondence between elements of a field and monic irreducible polynomials, and then you say some things about splitting fields. You can see a more detailed proof outline here, but it assumes you’re familiar with the notion of a minimal polynomial. We will probably cover this in a future primer.

But just using the theorem we can get a really nice algorithm for determining if a polynomial f(x) of degree m is irreducible: we just look at its gcd with all the x^{p^k} - x for k smaller than m. If all the gcds are constants, then we know it’s irreducible, and if even one is a non-constant polynomial then it has to be irreducible. Why’s that? Because if you have some nontrivial gcd d(x) = \gcd(f(x), x^{p^k} - x) for k < m, then it’s a factor of f(x) by definition. And since we know all irreducible monic polynomials are factors of that this collection of polynomials, if the gcd is always 1 then there are no other possible factors to be divisors. (If there is any divisor then there will be a monic irreducible one!) So the candidate polynomial must be irreducible. In fact, with a moment of thought it’s clear that we can stop at k= m/2, as any factor of large degree will necessarily require corresponding factors of small degree. So the algorithm to check for irreducibility is just this simple loop:

def isIrreducible(polynomial, p):
   ZmodP = IntegersModP(p)
   poly = polynomialsOver(ZmodP).factory
   x = poly([0,1])
   powerTerm = x
   isUnit = lambda p: p.degree() == 0

   for _ in range(int(polynomial.degree() / 2)):
      powerTerm = powerTerm.powmod(p, polynomial)
      gcdOverZmodp = gcd(polynomial, powerTerm - x)
      if not isUnit(gcdOverZmodp):
         return False

   return True

We’re just computing the powers iteratively as x^p, (x^p)^p = x^{p^2}, \dots, x^{p^j} and in each step of the loop subtracting x and computing the relevant gcd. The powmod function is just there so that we can reduce the power mod our irreducible polynomial after each multiplication, keeping the degree of the polynomial small and efficient to work with.

Now generating an irreducible polynomial is a little bit harder than testing for one. With a lot of hard work, however, field theorists discovered that irreducible polynomials are quite common. In fact, if you just generate the coefficients of your degree n monic polynomial at random, the chance that you’ll get something irreducible is at least 1/n. So this suggests an obvious randomized algorithm: keep guessing until you find one.

def generateIrreduciblePolynomial(modulus, degree):
   Zp = IntegersModP(modulus)
   Polynomial = polynomialsOver(Zp)

   while True:
      coefficients = [Zp(random.randint(0, modulus-1)) for _ in range(degree)]
      randomMonicPolynomial = Polynomial(coefficients + [Zp(1)])

      if isIrreducible(randomMonicPolynomial, modulus):
         return randomMonicPolynomial

Since the probability of getting an irreducible polynomial is close to 1/n, we expect to require n trials before we find one. Moreover we could give a pretty tight bound on how likely it is to deviate from the expected number of trials. So now we can generate some irreducible polynomials!

>>> F23 = FiniteField(2,3)
>>> generateIrreduciblePolynomial(23, 3)
21 + 12 x^1 + 11 x^2 + 1 x^3

And so now we are well-equipped to generate any finite field we want! It’s just a matter of generating the polynomial and taking a modulus after every operation.

@memoize
def FiniteField(p, m, polynomialModulus=None):
   Zp = IntegersModP(p)
   if m == 1:
      return Zp

   Polynomial = polynomialsOver(Zp)
   if polynomialModulus is None:
      polynomialModulus = generateIrreduciblePolynomial(modulus=p, degree=m)

   class Fq(FieldElement):
      fieldSize = int(p ** m)
      primeSubfield = Zp
      idealGenerator = polynomialModulus
      operatorPrecedence = 3

      def __init__(self, poly):
         if type(poly) is Fq:
            self.poly = poly.poly
         elif type(poly) is int or type(poly) is Zp:
            self.poly = Polynomial([Zp(poly)])
         elif isinstance(poly, Polynomial):
            self.poly = poly % polynomialModulus
         else:
            self.poly = Polynomial([Zp(x) for x in poly]) % polynomialModulus

         self.field = Fq

      @typecheck
      def __add__(self, other): return Fq(self.poly + other.poly)
      @typecheck
      def __sub__(self, other): return Fq(self.poly - other.poly)
      @typecheck
      def __mul__(self, other): return Fq(self.poly * other.poly)
      @typecheck
      def __eq__(self, other): return isinstance(other, Fq) and self.poly == other.poly

      def __pow__(self, n): return Fq(pow(self.poly, n))
      def __neg__(self): return Fq(-self.poly)
      def __abs__(self): return abs(self.poly)
      def __repr__(self): return repr(self.poly) + ' \u2208 ' + self.__class__.__name__

      @typecheck
      def __divmod__(self, divisor):
         q,r = divmod(self.poly, divisor.poly)
         return (Fq(q), Fq(r))

      def inverse(self):
         if self == Fq(0):
            raise ZeroDivisionError

         x,y,d = extendedEuclideanAlgorithm(self.poly, self.idealGenerator)
         return Fq(x) * Fq(d.coefficients[0].inverse())

   Fq.__name__ = 'F_{%d^%d}' % (p,m)
   return Fq

And some examples of using it:

>>> F23 = FiniteField(2,3)
>>> x = F23([1,1])
>>> x
1 + 1 x^1 ∈ F_{2^3}
>>> x*x
1 + 0 x^1 + 1 x^2 ∈ F_{2^3}
>>> x**10
0 + 0 x^1 + 1 x^2 ∈ F_{2^3}
>>> 1 / x
0 + 1 x^1 + 1 x^2 ∈ F_{2^3}
>>> x * (1 / x)
1 ∈ F_{2^3}
>>> k = FiniteField(23, 4)
>>> k.fieldSize
279841
>>> k.idealGenerator
6 + 8 x^1 + 10 x^2 + 10 x^3 + 1 x^4
>>> y
9 + 21 x^1 + 14 x^2 + 12 x^3 ∈ F_{23^4}
>>> y*y
13 + 19 x^1 + 7 x^2 + 14 x^3 ∈ F_{23^4}
>>> y**5 - y
15 + 22 x^1 + 15 x^2 + 5 x^3 ∈ F_{23^4}

And that’s it! Now we can do arithmetic over any finite field we want.

[7] Especially considering that other wacky things happen like this: x^4 +1 is reducible over every finite field!

Some Words on Efficiency

There are a few things that go without stating about this program, but I’ll state them anyway.

The first is that we pay a big efficiency price for being so generic. All the typecasting we’re doing isn’t cheap, and in general cryptography needs to be efficient. For example, if I try to create a finite field of order 104729^{20}, it takes about ten to fifteen seconds to complete. This might not seem so bad for a one-time initialization, but it’s clear that our representation is somewhat bloated. We would display a graph of the expected time to perform various operations in various finite fields, but this post is already way too long.

In general, the larger and more complicated the polynomial you use to define your finite field, the longer operations will take (since dividing by a complicated polynomial is more expensive than dividing by a simple polynomial). For this reason and a few other reasons, a lot of research has gone into efficiently finding irreducible polynomials with, say, only three nonzero terms. Moreover, if we know in advance that we’ll only work over fields of characteristic two we can make a whole lot of additional optimizations. Essentially, all of the arithmetic reduces to really fast bitwise operations, and things like exponentiation can easily be implemented in hardware. But it also seems that the expense coming with large field characteristics corresponds to higher security, so some researchers have tried to meet in the middle an get efficient representations of other field characteristics.

In any case, the real purpose of our implementation in this post is not for efficiency. We care first and foremost about understanding the mathematics, and to have a concrete object to play with and use in the future for other projects. And we have accomplished just that.

Until next time!

Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability p. If X denotes the number of trials needed to get the first success, then clearly \Pr(X = k) = (1-p)^{k-1} p (since first we need k-1 failures which occur independently with probability 1-p, then we need one success which happens with probability p). Thus the expected value of X is

\displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p

by basic calculus. In particular, if success is defined as getting a six, then p=1/6 thus the expected time is 1/p=6.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is \frac{1}{26}, thus the probability of a random eleven-letter word being ABRACADABRA is exactly \left(\frac{1}{26}\right)^{11}. So if typing 11 letters is one trial, the expected number of trials is

\displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}

which means 11\cdot 26^{11} keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that 11\cdot 26^{11} is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses $1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive $26 if he wins, since the probability of getting the next letter right is exactly \frac{1}{26} (thus the expected value of the change in the gambler’s fortune is \frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as T. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly T dollars.

How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins $26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins \$26^4. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be \$26^{11}. Thus our casino will have to give out 26^{11}+26^4+26 dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is T dollars, the expected value of our expenses is 26^{11}+26^4+26 dollars, thus E(T)=26^{11}+26^4+26. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. X_0 will denote the gambler’s fortune before the game starts, X_1 the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of X_n, given X_1, X_2, \ldots, X_{n-1} is the same as X_{n-1}. This can be written as E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1} or, equivalently, E(X_n - X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let X'_n = X_n if n < T and X'_n = X_T if n \ge T where T denotes the stopping time, i.e. the time at which the desired event occurs. Notice that T itself is a random variable.

We require our stopping time T to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let X_n be a martingale stopped at step T, and suppose one of the following three conditions hold:

  1. The stopping time T is almost surely bounded by some constant;
  2. The stopping time T is almost surely finite and every step of the stopped martingale X_n is almost surely bounded by some constant; or
  3. The expected stopping time E(T) is finite and the absolute value of the martingale increments |X_n-X_{n-1}| are almost surely bounded by a constant.

Then E(X_T) = E(X_0).

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than 26^{12}) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by 26^{11}+26^4+26. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has a dollars, the second player has b dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let X_n denote the change in the second player’s fortune, and set X_0 = 0. Let T_k denote the first time s when X_s = k. Then our first question can be formalized as trying to determine \Pr(T_{-b} < T_a). Let t = \min \{ T_{-b}, T_a\}. Clearly t is a stopping time. By the optional stopping theorem we have that

\displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))

thus \Pr(T_{-b} < T_a)=\frac{a}{a+b}.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: X_n^2-n is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with k-SAT for some k \ne 3 instead? Clearly the hardness of the problem is monotone increasing in k since k-SAT is a special case of (k+1)-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as k-SAT for any k>3. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals x and y are connected by a directed edge if there is a clause (\bar x \lor y). Recall that \bar x \lor y is equivalent to x \implies y, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths x \to \bar x and \bar x \to x (since x \Leftrightarrow \bar x is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after O(n^2) rounds where n denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in O(n^2) steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by \pm 1 in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least \frac{1}{2}. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond n. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number n, on the other hand, is called a reflecting barrier: we cannot reach n+1, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in O(n^2) steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

Optimism in the Face of Uncertainty: the UCB1 Algorithm

startups

The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of K actions labeled by the integers \left \{ 1, 2, \dots, K \right \}. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

Exploit away, you lucky ladies.

Exploit away, you lucky ladies.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action i be drawn from a fixed distribution D_i (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm A and a set of actions \left \{1, 2, \dots, K \right \}, the cumulative regret of A in rounds 1, \dots, T is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of A for the first T rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.

Notation

Let’s go ahead and draw up some notation. As before the actions are labeled by integers \left \{ 1, \dots, K \right \}. The reward of action i is a [0,1]-valued random variable X_i distributed according to an unknown distribution and possessing an unknown expected value \mu_i. The game progresses in rounds t = 1, 2, \dots so that in each round we have different random variables X_{i,t} for the reward of action i in round t (in particular, X_{i,t} and X_{i,s} are identically distributed). The X_{i,t} are independent as both t and i vary, although when i varies the distribution changes.

So if we were to play action 2 over and over for T rounds, then the total payoff would be the random variable G_2(T) = \sum_{t=1}^T X_{2,t}. But by independence across rounds and the linearity of expectation, the expected payoff is just \mu_2 T. So we can describe the best action as the action with the highest expected payoff. Define

\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i

We call the action which achieves the maximum i^*.

A policy is a randomized algorithm A which picks an action in each round based on the history of chosen actions and observed rewards so far. Define I_t to be the action played by A in round t and P_i(n) to be the number of times we’ve played action i in rounds 1 \leq t \leq n. These are both random variables. Then the cumulative payoff for the algorithm A over the first T rounds, denoted G_A(T), is just

\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}

and its expected value is simply

\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T)).

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy A after the first T steps, denoted R_A(T) can be written as

\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T)).

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than \Omega(\sqrt{KT}). It is also known that there are algorithms that guarantee no worse than O(\sqrt{KT}) expected regret. The algorithm we’ll see in the next section, however, only guarantees O(\sqrt{KT \log T}). We present it on this blog because of its simplicity and ubiquity in the field.

The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action \mu_i is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose Y_1, \dots, Y_n are independent random variables whose values lie in [0,1] and whose expected values are \mu_i. Call Y = \frac{1}{n}\sum_{i}Y_i and \mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of Y deviates from its mean. Specifically,

\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}

For us, the Y_i will be the payoff variables for a single action j in the rounds for which we choose action j. Then the variable Y is just the empirical average payoff for action j over all the times we’ve tried it. Moreover, a is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for a to find an upper bound big enough to be confident that we’re within a of the true mean.

Indeed, if we call n_j the number of times we played action j thus far, then n = n_j in the equation above, and using a = a(j,T) = \sqrt{2 \log(T) / n_j} we get that \textup{P}(Y > \mu + a) \leq T^{-4}, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action j, its upper bound a grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound a(j, T) is getting bigger for actions we haven’t played; any round in which we play an action j, it must be that a(j, T+1) = a(j,T), although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

UCB1:
Play each of the K actions once, giving initial values for empirical mean payoffs \overline{x}_i of each action i.
For each round t = K, K+1, \dots:

Let n_j represent the number of times action j was played so far.
Play the action j maximizing \overline{x}_j + \sqrt{2 \log t / n_j}.
Observe the reward X_{j,t} and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means x_j change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with K actions, each of whose reward distribution X_{i,t} has values in [0,1]. Then its expected cumulative regret after T rounds is at most O(\sqrt{KT \log T}).

Actually, we’ll prove a more specific theorem. Let \Delta_i be the difference \mu^* - \mu_i, where \mu^* is the expected payoff of the best action, and let \Delta be the minimal nonzero \Delta_i. That is, \Delta_i represents how suboptimal an action is and \Delta is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret \mathbb{E}(R_{\textup{UCB1}}(T)) is at most

\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if \Delta_i is small we will require more tries to know that action i is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the 1 + \pi^2 / 3 part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the O(\sqrt{KT \log T}) bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function R(\Delta_1, \dots, \Delta_K) (excluding of course \Delta^*), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}

and it’s zero if and only if for each i, \Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T}). However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the \Delta_i = O(\sqrt{\log T}) (which are all the same) gives a total bound of O(K \sqrt{\log T}). If we look at the only possible endpoint (the \Delta_i = 1), then we get a local maximum of O(K \sqrt{\log T}). But this isn’t the O(\sqrt{KT \log T}) we promised, what gives? Well, this upper bound grows arbitrarily large as the \Delta_i go to zero. But at the same time, if all the \Delta_i are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the \Delta_i = \Delta are the same, then another trivial regret bound is \Delta T (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at \Delta = O(\sqrt{(K \log T) / T}). Hence the regret bound at the switch is \Delta T = O(\sqrt{KT \log T}), as desired.

Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on P_i(T), the expected number of times UCB chooses an action up to round T. Using the \Delta notation, the regret is then just \sum_i \Delta_i \mathbb{E}(P_i(T)), and bounding the P_i‘s will bound the regret.

Recall the notation for our upper bound a(j, T) = \sqrt{2 \log T / P_j(T)} and let’s loosen it a bit to a(y, T) = \sqrt{2 \log T / y} so that we’re allowed to “pretend” a action has been played y times. Recall further that the random variable I_t has as its value the index of the machine chosen. We denote by \chi(E) the indicator random variable for the event E. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., \overline{x}^* is the empirical mean of the optimal action).

Indeed for any action i, the only way we know how to write down P_i(T) is as

\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action i. Now we’re just going to pull some number m-1 of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than m times overall, this requires an inequality.

P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)

These indicator functions should be read as sentences: we’re just saying that we’re picking action i in round t and we’ve already played i at least m times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action i in round t means that the upper bound for action i exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and i might coincide with the best action, but that’s fine). In notation this event is

\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)

Denote the upper bound \overline{x}_i + a(i,t) for action i in round t by U_i(t). Since this event must occur every time we pick action i (though not necessarily vice versa), we have

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)

We’ll do this process again but with a slightly more complicated event. If the upper bound of action i exceeds that of the optimal machine, it is also the case that the maximum upper bound for action i we’ve seen after the first m trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round t we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine i (except that it’s more than m). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by \overline{x}_{i,s} the random variable for the empirical mean after playing action i a total of s times, and \overline{x}^*_s the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )

Indeed, at each t for which the max is greater than the min, there will be at least one pair s,s' for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs s, s' for which it happens. That is, we can expand the event above into the double sum which is at least as large:

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )

We can make one other odd inequality by increasing the sum to go from t=1 to \infty. This will become clear later, but it means we can replace t-1 with t and thus have

\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t). Then what can we say? Well, consider the following three events:

(1) \displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)
(2) \displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)
(3) \displaystyle \mu^* < \mu_i + 2a(s, t)

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability t^{-4}. Likewise, (2) is the event that the empirical mean payoff of action i is larger than the upper confidence bound, which also occurs with probability t^{-4}. We will see momentarily that (3) is impossible for a well-chosen m (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}

and

\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}

But putting these two inequalities together gives us precisely that (3) is true:

\mu^* < \mu_i + 2a(s,t)

This proves the claim.

By the union bound, the probability that at least one of these events happens is 2t^{-4} plus whatever the probability of (3) being true is. But as we said, we’ll pick m to make (3) always false. Indeed m depends on which action i is being played, and if s \geq m > 8 \log T / \Delta_i^2 then 2a(s,t) \leq \Delta_i, and by the definition of \Delta_i we have

\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

\square

Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
   return math.sqrt(2 * math.log(step + 1) / numPlays)

And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
   payoffSums = [0] * numActions
   numPlays = [1] * numActions
   ucbs = [0] * numActions

   # initialize empirical sums
   for t in range(numActions):
      payoffSums[t] = reward(t,t)
      yield t, payoffSums[t], ucbs

   t = numActions

   while True:
      ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
      action = max(range(numActions), key=lambda i: ucbs[i])
      theReward = reward(action, t)
      numPlays[action] += 1
      payoffSums[action] += theReward

      yield action, theReward, ucbs
      t = t + 1

The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from [0,1], biased by 1/k for some 5 \leq k \leq 15. The regret and theoretical regret bound are given in the graph below.

ucb1-simple-example

The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!