When Greedy Algorithms are Good Enough: Submodularity and the (1 – 1/e)-Approximation

Greedy algorithms are among the simplest and most intuitive algorithms known to humans. Their name essentially gives their description: do the thing that looks best right now, and repeat until nothing looks good anymore or you’re forced to stop. Some of the best situations in computer science are also when greedy algorithms are optimal or near-optimal. There is a beautiful theory of this situation, known as the theory of matroids. We haven’t covered matroids on this blog (at some point we will), but in this post we will focus on the next best thing: when the greedy algorithm guarantees a reasonably good approximation to the optimal solution.

This situation isn’t hard to formalize, and we’ll make it as abstract as possible. Say you have a set of objects X, and you’re looking to find the “best” subset S \subset X. Here “best” is just measured by a fixed (known, efficiently computable) objective function f : 2^X \to \mathbb{R}. That is, f accepts as input subsets of X and outputs numbers so that better subsets have larger numbers. Then the goal is to find a subset maximizing X.

In this generality the problem is clearly impossible. You’d have to check all subsets to be sure you didn’t miss the best one. So what conditions do we need on either X or f or both that makes this problem tractable? There are plenty you could try, but one very rich property is submodularity.

The Submodularity Condition

I think the simplest way to explain submodularity is in terms of coverage. Say you’re starting a new radio show and you have to choose which radio stations to broadcast from to reach the largest number of listeners. For simplicity say each radio station has one tower it broadcasts from, and you have a good estimate of the number of listeners you would reach if you broadcast from a given tower. For more simplicity, say it costs the same to broadcast from each tower, and your budget restricts you to a maximum of ten stations to broadcast from. So the question is: how do you pick towers to maximize your overall reach?

The hidden condition here is that some towers overlap in which listeners they reach. So if you broadcast from two towers in the same city, a listener who has access to both will just pick one or the other. In other words, there’s a diminished benefit to picking two overlapping towers if you already have chosen one.

In our version of the problem, picking both of these towers has some small amount of "overkill."

In our version of the problem, picking both of these towers has some small amount of “overkill.”

This “diminishing returns” condition is a general idea you can impose on any function that takes in subsets of a given set and produces numbers. If X is a set then for what seems like a strange reason we denote the set of all subsets of X by 2^X. So we can state this condition more formally,

Definition: Let X be a finite set. A function f: 2^X \to \mathbb{R} is called submodular if for all subsets S \subset T \subset X and all x \in X \setminus T,

\displaystyle f(S \cup \{ x \}) - f(S) \geq f(T \cup \{ x \}) - f(T)

In other words, if f measures “benefit,” then the marginal benefit of adding x to S is at least as high as the marginal benefit of adding it to T. Since S \subset T and x are all arbitrary, this is as general as one could possibly make it.

Before we start doing things with submodular functions, let’s explore some basic properties. The first is an equivalent definition of submodularity

Proposition: f is submodular if and only if for all A, B \subset X, it holds that

\displaystyle f(A \cap B) + f(A \cup B) \leq f(A) + f(B).

Proof. If we assume f has the condition from this proposition, then we can set A=T, B=S \cup \{ x \}, and the formula just works out. Conversely, if we have the condition from the definition, then using the fact that A \cap B \subset B we can inductively apply the inequality to each element of A \setminus B to get

\displaystyle f(A \cup B) - f(B) \leq f(A) - f(A \cap B)

\square

Next, we can tweak and combine submodular functions to get more submodular functions. In particular, non-negative linear combinations of sub-modular functions are submodular. In other words, if f_1, \dots, f_k are submodular on the same set X, and \alpha_1, \dots, \alpha_k are all non-negative reals, then \alpha_1 f_1 + \dots + \alpha_k f_k is also a submodular function on X. It’s an easy exercise in applying the definition to see why this is true. This is important because when we’re designing objectives to maximize, we can design them by making some simple submodular pieces, and then picking an appropriate combination of those pieces.

The second property we need to impose on a submodular function is monotonicity. That is, as your sets get more elements added to them, their value under f only goes up. In other words, f is monotone when S \subset T then f(S) \leq f(T). An interesting property of functions that are both submodular and monotone is that the truncation of such a function is also submodular and monotone. In other words, \textup{min}(f(S), c) is still submodular when f is monotone submodular and c is a constant.

Submodularity and Monotonicity Give 1 – 1/e

The wonderful thing about submodular functions is that we have a lot of great algorithmic guarantees for working with them. We’ll prove right now that the coverage problem (while it might be hard to solve in general) can be approximated pretty well by the greedy algorithm.

Here’s the algorithmic setup. I give you a finite set X and an efficient black-box to evaluate f(S) for any subset S \subset X you want. I promise you that f is monotone and submodular. Now I give you an integer k between 1 and the size of X, and your task is to quickly find a set S of size k for which f(S) is maximal among all subsets of size k. That is, you design an algorithm that will work for any k, X, f and runs in polynomial time in the sizes of X, k.

In general this problem is NP-hard, meaning you’re not going to find a solution that works in the worst case (if you do, don’t call me; just claim your million dollar prize). So how well can we approximate the optimal value for f(S) by a different set of size k? The beauty is that, if your function is monotone and submodular, you can guarantee to get within 63% of the optimum. The hope (and reality) is that in practice it will often perform much better, but still this is pretty good! More formally,

Theorem: Let f be a monotone, submodular, non-negative function on X. The greedy algorithm, which starts with S as the empty set and at every step picks an element x which maximizes the marginal benefit f(S \cup \{ x \}) - f(S), provides a set S that achieves a (1- 1/e)-approximation of the optimum.

We’ll prove this in just a little bit more generality, and the generality is quite useful. If we call S_1, S_2, \dots, S_l the sets chosen by the greedy algorithm (where now we might run the greedy algorithm for l > k steps), then for all l, k, we have

\displaystyle f(S_l) \geq \left ( 1 - e^{-l/k} \right ) \max_{T: |T| \leq k} f(T)

This allows us to run the algorithm for more than k steps to get a better approximation by sets of larger size, and quantify how much better the guarantee on that approximation would be. It’s like an algorithmic way of hedging your risk. So let’s prove it.

Proof. Let’s set up some notation first. Fix your l and k, call S_i the set chosen by the greedy algorithm at step i, and call S^* the optimal subset of size k. Further call \textup{OPT} the value of the best set f(S^*). Call x_1^*, \dots, x_k^* the elements of S^* (the order is irrelevant). Now for every i < l monotonicity gives us f(S^*) \leq f(S^* \cup S_i). We can unravel this into a sum of marginal gains of adding single elements. The first step is

\displaystyle f(S^* \cup S_i) = f(S^* \cup S_i) - f(\{ x_1^*, \dots, x_{k-1}^* \} \cup S_i) + f(\{ x_1^*, \dots, x_{k-1}^* \} \cup S_i)

The second step removes x_{k-1}^*, from the last term, the third removes x_{k-2}^*, and so on until we have removed all of S^* and get this sum

\displaystyle f(S^* \cup S_i) = f(S_i) + \sum_{j=1}^k \left ( f(S_i \cup \{ x_1^*, \dots, x_j^* \}) - f(S_i \cup \{ x_1^*, \dots, x_{j-1}^* \} ) \right )

Now, applying submodularity, we can change all of these marginal benefits of “adding one more S^* element to S_i already with some S^* stuff” to “adding one more S^* element to just S_i.” In symbols, the equation above is at most

\displaystyle f(S_i) + \sum_{x \in S^*} f(S_i \cup \{ x \}) - f(S_i)

and because S_{i+1} is greedily chosen to maximize the benefit of adding a single element, so the above is at most

\displaystyle f(S_i) + \sum_{x \in S^*} f(S_{i+1}) - f(S_i) = f(S_i) + k(f(S_{i+1}) - f(S_i))

Chaining all of these together, we have f(S^*) - f(S_i) \leq k(f(S_{i+1}) - f(S_i)). If we call a_{i} = f(S^*) - f(S_i), then this inequality can be rewritten as a_{i+1} \leq (1 - 1/k) a_{i}. Now by induction we can relate a_l \leq (1 - 1/k)^l a_0. Now use the fact that a_0 \leq f(S^*) and the common inequality 1-x \leq e^{-x} to get

\displaystyle a_l = f(S^*) - f(S_l) \leq e^{-l/k} f(S^*)

And rearranging gives f(S_l) \geq (1 - e^{-l/k}) f(S^*).

\square

Setting l=k gives the approximation bound we promised. But note that allowing the greedy algorithm to run longer can give much stronger guarantees, though it requires you to sacrifice the cardinality constraint. 1 - 1/e is about 63%, but doubling the size of S gives about an 86% approximation guarantee. This is great for people in the real world, because you can quantify the gains you’d get by relaxing the constraints imposed on you (which are rarely set in stone).

So this is really great! We have quantifiable guarantees on a stupidly simple algorithm, and the setting is super general. And so if you have your problem and you manage to prove your function is submodular (this is often the hardest part), then you are likely to get this nice guarantee.

Extensions and Variations

This result on monotone submodular functions is just one part of a vast literature on finding approximation algorithms for submodular functions in various settings. In closing this post we’ll survey some of the highlights and provide references.

What we did in this post was maximize a monotone submodular function subject to a cardinality constraint |S| \leq k. There are three basic variations we could do: we could drop constraints and see whether we can still get guarantees, we could look at minimization instead of maximization, and we could modify the kinds of constraints we impose on the solution.

There are a ton of different kinds of constraints, and we’ll discuss two. The first is where you need to get a certain value f(S) \geq q, and you want to find the smallest set that achieves this value. Laurence Wolsey (who proved a lot of these theorems) showed in 1982 that a slight variant of the greedy algorithm can achieve a set whose size is a multiplicative factor of 1 + \log (\max_x f(\{ x \})) worse than the optimum.

The second kind of constraint is a generalization of a cardinality constraint called a knapsack constraint. This means that each item x \in X has a cost, and you have a finite budget with which to spend on elements you add to S. One might expect this natural extension of the greedy algorithm to work: pick the element which maximizes the ratio of increasing the value of f to the cost (within your available budget). Unfortunately this algorithm can perform arbitrarily poorly, but there are two fun caveats. The first is that if you do both this augmented greedy algorithm and the greedy algorithm that ignores costs, then at least one of these can’t do too poorly. Specifically, one of them has to get at least a 30% approximation. This was shown by Leskovec et al in 2007. The second is that if you’re willing to spend more time in your greedy step by choosing the best subset of size 3, then you can get back to the 1-1/e approximation. This was shown by Sviridenko in 2004.

Now we could try dropping the monotonicity constraint. In this setting cardinality constraints are also superfluous, because it could be that the very large sets have low values. Now it turns out that if f has no other restrictions (in particular, if it’s allowed to be negative), then even telling whether there’s a set S with f(S) > 0 is NP-hard, but the optimum could be arbitrarily large and positive when it exists. But if you require that f is non-negative, then you can get a 1/3-approximation, if you’re willing to add randomness you can get 2/5 in expectation, and with more subtle constraints you can get up to a 1/2 approximation. Anything better is NP-hard. Fiege, Mirrokni, and Vondrak have a nice FOCS paper on this.

Next, we could remove the monotonicity property and try to minimize the value of f(S). It turns out that this problem always has an efficient solution, but the only algorithm I have heard of to solve it involves a very sophisticated technique called the ellipsoid algorithm. This is heavily related to linear programming and convex optimization, something which I hope to cover in more detail on this blog.

Finally, there are many interesting variations in the algorithmic procedure. For example, one could require that the elements are provided in some order (the streaming setting), and you have to pick at each step whether to put the element in your set or not. Alternatively, the objective functions might not be known ahead of time and you have to try to pick elements to jointly maximize them as they are revealed. These two settings have connections to bandit learning problems, which we’ve covered before on this blog. See this survey of Krause and Golovin for more on the connections, which also contains the main proof used in this post.

Indeed, despite the fact that many of the big results were proved in the 80’s, the analysis of submodular functions is still a big research topic. There was even a paper posted just the other day on the arXiv about it’s relation to ad serving! And wouldn’t you know, they proved a (1-1/e)-approximation for their setting. There’s just something about 1-1/e.

Until next time!

About these ads

Community Detection in Graphs — a Casual Tour

Graphs are among the most interesting and useful objects in mathematics. Any situation or idea that can be described by objects with connections is a graph, and one of the most prominent examples of a real-world graph that one can come up with is a social network.

Recall, if you aren’t already familiar with this blog’s gentle introduction to graphs, that a graph G is defined by a set of vertices V, and a set of edges E, each of which connects two vertices. For this post the edges will be undirected, meaning connections between vertices are symmetric.

One of the most common topics to talk about for graphs is the notion of a community. But what does one actually mean by that word? It’s easy to give an informal definition: a subset of vertices C such that there are many more edges between vertices in C than from vertices in C to vertices in V - C (the complement of C). Try to make this notion precise, however, and you open a door to a world of difficult problems and open research questions. Indeed, nobody has yet come to a conclusive and useful definition of what it means to be a community. In this post we’ll see why this is such a hard problem, and we’ll see that it mostly has to do with the word “useful.” In future posts we plan to cover some techniques that have found widespread success in practice, but this post is intended to impress upon the reader how difficult the problem is.

The simplest idea

The simplest thing to do is to say a community is a subset of vertices which are completely connected to each other. In the technical parlance, a community is a subgraph which forms a clique. Sometimes an n-clique is also called a complete graph on n vertices, denoted K_n. Here’s an example of a 5-clique in a larger graph:

"Where's Waldo" for graph theorists: a clique hidden in a larger graph.

“Where’s Waldo” for graph theorists: a clique hidden in a larger graph.

Indeed, it seems reasonable that if we can reliably find communities at all, then we should be able to find cliques. But as fate should have it, this problem is known to be computationally intractable. In more detail, the problem of finding the largest clique in a graph is NP-hard. That essentially means we don’t have any better algorithms to find cliques in general graphs than to try all possible subsets of the vertices and check to see which, if any, form cliques. In fact it’s much worse, this problem is known to be hard to approximate to any reasonable factor in the worst case (the error of the approximation grows polynomially with the size of the graph!). So we can’t even hope to find a clique half the size of the biggest, or a thousandth the size!

But we have to take these impossibility results with a grain of salt: they only say things about the worst case graphs. And when we’re looking for communities in the real world, the worst case will never show up. Really, it won’t! In these proofs, “worst case” means that they encode some arbitrarily convoluted logic problem into a graph, so that finding the clique means solving the logic problem. To think that someone could engineer their social network to encode difficult logic problems is ridiculous.

So what about an “average case” graph? To formulate this typically means we need to consider graphs randomly drawn from a distribution.

Random graphs

The simplest kind of “randomized” graph you could have is the following. You fix some set of vertices, and then run an experiment: for each pair of vertices you flip a coin, and if the coin is heads you place an edge and otherwise you don’t. This defines a distribution on graphs called G(n, 1/2), which we can generalize to G(n, p) for a coin with bias p. With a slight abuse of notation, we call G(n, p) the Erdős–Rényi random graph (it’s not a graph but a distribution on graphs). We explored this topic form a more mathematical perspective earlier on this blog.

So we can sample from this distribution and ask questions like: what’s the probability of the largest clique being size at least 20? Indeed, cliques in Erdős–Rényi random graphs are so well understood that we know exactly how they work. For example, if p=1/2 then the size of the largest clique is guaranteed (with overwhelming probability as n grows) to have size k(n) or k(n)+1, where k(n) is about 2 \log n. Just as much is known about other values of p as well as other properties of G(n,p), see Wikipedia for a short list.

In other words, if we wanted to find the largest clique in an Erdős–Rényi random graph, we could check all subsets of size roughly 2\log(n), which would take about (n / \log(n))^{\log(n)} time. This is pretty terrible, and I’ve never heard of an algorithm that does better (contrary to the original statement in this paragraph that showed I can’t count). In any case, it turns out that the Erdős–Rényi random graph, and using cliques to represent communities, is far from realistic. There are many reasons why this is the case, but here’s one example that fits with the topic at hand. If I thought the world’s social network was distributed according to G(n, 1/2) and communities were cliques, then I would be claiming that the largest community is of size 65 or 66. Estimated world population: 7 billion, 2 \log(7 \cdot 10^9) \sim 65. Clearly this is ridiculous: there are groups of larger than 66 people that we would want to call “communities,” and there are plenty of communities that don’t form bona-fide cliques.

Another avenue shows that things are still not as easy as they seem in Erdős–Rényi land. This is the so-called planted clique problem. That is, you draw a graph G from G(n, 1/2). You give G to me and I pick a random but secret subset of r vertices and I add enough edges to make those vertices form an r-clique. Then I ask you to find the r-clique. Clearly it doesn’t make sense when r < 2 \log (n) because you won’t be able to tell it apart from the guaranteed cliques in G. But even worse, nobody knows how to find the planted clique when r is even a little bit smaller than \sqrt{n} (like, r = n^{9/20} even). Just to solidify this with some numbers, we don’t know how to reliably find a planted clique of size 60 in a random graph on ten thousand vertices, but we do when the size of the clique goes up to 100. The best algorithms we know rely on some sophisticated tools in spectral graph theory, and their details are beyond the scope of this post.

So Erdős–Rényi graphs seem to have no hope. What’s next? There are a couple of routes we can take from here. We can try to change our random graph model to be more realistic. We can relax our notion of communities from cliques to something else. We can do both, or we can do something completely different.

Other kinds of random graphs

There is an interesting model of Barabási and Albert, often called the “preferential attachment” model, that has been described as a good model of large, quickly growing networks like the internet. Here’s the idea: you start off with a two-clique G = K_2, and at each time step t you add a new vertex v to G, and new edges so that the probability that the edge (v,w) is added to G is proportional to the degree of w (as a fraction of the total number of edges in G). Here’s an animation of this process:

Image source: Wikipedia

The significance of this random model is that it creates graphs with a small number of hubs, and a large number of low-degree vertices. In other words, the preferential attachment model tends to “make the rich richer.” Another perspective is that the degree distribution of such a graph is guaranteed to fit a so-called power-law distribution. Informally, this means that the overall fraction of small-degree vertices gives a significant contribution to the total number of edges. This is sometimes called a “fat-tailed” distribution. Since power-law distributions are observed in a wide variety of natural settings, some have used this as justification for working in the preferential attachment setting. On the other hand, this model is known to have no significant community structure (by any reasonable definition, certainly not having cliques of nontrivial size), and this has been used as evidence against the model. I am not aware of any work done on planting dense subgraphs in graphs drawn from a preferential attachment model, but I think it’s likely to be trivial and uninteresting. On the other hand, Bubeck et al. have looked at changing the initial graph (the “seed”) from a 2-clique to something else, and seeing how that affects the overall limiting distribution.

Another model that often shows up is a model that allows one to make a random graph starting with any fixed degree distribution, not just a power law. There are a number of models that do this to some fashion, and you’ll hear a lot of hyphenated names thrown around like Chung-Lu and Molloy-Reed and Newman-Strogatz-Watts. The one we’ll describe is quite simple. Say you start with a set of vertices V, and a number d_v for each vertex v, such that the sum of all the d_v is even. This condition is required because in any graph the sum of the degrees of a vertex is twice the number of edges. Then you imagine each vertex v having d_v “edge-stubs.” The name suggests a picture like the one below:

Each node has a prescribed number of "edge stubs," which are randomly connected to form a graph.

Each node has a prescribed number of “edge stubs,” which are randomly connected to form a graph.

Now you pick two edge stubs at random and connect them. One usually allows self-loops and multiple edges between vertices, so that it’s okay to pick two edge stubs from the same vertex. You keep doing this until all the edge stubs are accounted for, and this is your random graph. The degrees were fixed at the beginning, so the only randomization is in which vertices are adjacent. The same obvious biases apply, that any given vertex is more likely to be adjacent to high-degree vertices, but now we get to control the biases with much more precision.

The reason such a model is useful is that when you’re working with graphs in the real world, you usually have statistical information available. It’s simple to compute the degree of each vertex, and so you can use this random graph as a sort of “prior” distribution and look for anomalies. In particular, this is precisely how one of the leading measures of community structure works: the measure of modularity. We’ll talk about this in the next section.

Other kinds of communities

Here’s one easy way to relax our notion of communities. Rather than finding complete subgraphs, we could ask about finding very dense subgraphs (ignoring what happens outside the subgraph). We compute density as the average degree of vertices in the subgraph.

If we impose no bound on the size of the subgraph an algorithm is allowed to output, then there is an efficient algorithm for finding the densest subgraph in a given graph. The general exact solution involves solving a linear programming problem and a little extra work, but luckily there is a greedy algorithm that can get within half of the optimal density. You start with all the vertices S_n = V, and remove any vertex of minimal degree to get S_{n-1}. Continue until S_0, and then compute the density of all the S_i. The best one is guaranteed to be at least half of the optimal density. See this paper of Moses Charikar for a more formal analysis.

One problem with this is that the size of the densest subgraph might be too big. Unfortunately, if you fix the size of the dense subgraph you’re looking for (say, you want to find the densest subgraph of size at most k where k is an input), then the problem once again becomes NP-hard and suffers from the same sort of inapproximability theorems as finding the largest clique.

A more important issue with this is that a dense subgraph isn’t necessarily a community. In particular, we want communities to be dense on the inside and sparse on the outside. The densest subgraph analysis, however, might rate the following graph as one big dense subgraph instead of two separately dense communities with some modest (but not too modest) amount of connections between them.

What should be identified as communities?

What are the correct communities here?

Indeed, we want a quantifiable a notion of “dense on the inside and sparse on the outside.” One such formalization is called modularity. Modularity works as follows. If you give me some partition of the vertices of G into two sets, modularity measures how well this partition reflects two separate communities. It’s the definition of “community” here that makes it interesting. Rather than ask about densities exactly, you can compare the densities to the expected densities in a given random graph model.

In particular, we can use the fixed-degree distribution model from the last section. If we know the degrees of all the vertices ahead of time, we can compute the probability that we see some number of edges going between the two pieces of the partition relative to what we would see at random. If the difference is large (and largely biased toward fewer edges across the partition and more edges within the two subsets), then we say it has high modularity. This involves a lot of computations  — the whole measure can be written as a quadratic form via one big matrix — but the idea is simple enough. We intend to write more about modularity and implement the algorithm on this blog, but the excited reader can see the original paper of M.E.J. Newman.

Now modularity is very popular but it too has shortcomings. First, even though you can compute the modularity of a given partition, there’s still the problem of finding the partition that globally maximizes modularity. Sadly, this is known to be NP-hard. Mover, it’s known to be NP-hard even if you’re just trying to find a partition into two pieces that maximizes modularity, and even still when the graph is regular (every vertex has the same degree).

Still worse, while there are some readily accepted heuristics that often “do well enough” in practice, we don’t even know how to approximate modularity very well. Bhaskar DasGupta has a line of work studying approximations of maximum modularity, and he has proved that for dense graphs you can’t even approximate modularity to within any constant factor. That is, the best you can do is have an approximation that gets worse as the size of the graph grows. It’s similar to the bad news we had for finding the largest clique, but not as bad. For example, when the graph is sparse it’s known that one can approximate modularity to within a \log(n) factor of the optimum, where n is the number of vertices of the graph (for cliques the factor was like n^c for some c, and this is drastically worse).

Another empirical issue is that modularity seems to fail to find small communities. That is, if your graph has some large communities and some small communities, strictly maximizing the modularity is not the right thing to do. So we’ve seen that even the leading method in the field has some issues.

Something completely different

The last method I want to sketch is in the realm of “something completely different.” The notion is that if we’re given a graph, we can run some experiment on the graph, and the results of that experiment can give us insight into where the communities are.

The experiment I’m going to talk about is the random walk. That is, say you have a vertex v in a graph G and you want to find some vertices that are “closest” to v. That is, those that are most likely to be in the same community as v. What you can do is run a random walk starting at v. By a “random walk” I mean you start at v, you pick a neighbor at random and move to it, then repeat. You can compute statistics about the vertices you visit in a sample of such walks, and the vertices that you visit most often are those you say are “in the same community as v. One important parameter is how long the walk is, but it’s generally believed to be best if you keep it between 3-6 steps.

Of course, this is not a partition of the vertices, so it’s not a community detection algorithm, but you can turn it into one. Run this process for each vertex, and use it to compute a “distance” between all the pairs of vertices. Then you compute a tree of partitions by lumping the closest pairs of vertices into the same community, one at a time, until you’ve got every vertex. At each step of the way, you compute the modularity of the partition, and when you’re done you choose the partition that maximizes modularity. This algorithm as a whole is called the walktrap clustering algorithm, and was introduced by Pons and Latapy in 2005.

This sounds like a really great idea, because it’s intuitive: there’s a relatively high chance that the friends of your friends are also your friends. It’s also really great because there is an easily measurable tradeoff between runtime and quality: you can tune down the length of the random walk, and the number of samples you take for each vertex, to speed up the runtime but lower the quality of your statistical estimates. So if you’re working on huge graphs, you get a lot of control and a clear idea of exactly what’s going on inside the algorithm (something which is not immediately clear in a lot of these papers).

Unfortunately, I’m not aware of any concrete theoretical guarantees for walktrap clustering. The one bit of theoretical justification I’ve read over the last year is that you can relate the expected distances you get to certain spectral properties of the graph that are known to be related to community structure, but the lower bounds on maximizing modularity already suggest (though they do not imply) that walktrap won’t do that well in the worst case.

So many algorithms, so little time!

I have only brushed the surface of the literature on community detection, and the things I have discussed are heavily biased toward what I’ve read about and used in my own research. There are methods based on information theory, label propagation, and obscure physics processes like “spin glass” (whatever that is, it sounds frustrating).

And we have only been talking about perfect community structure. What if you want to allow people to be in multiple communities, or have communities at varying levels of granularity (e.g. a sports club within a school versus the whole student body of that school)? What if we want to allow people to be “members” of a community at varying degrees of intensity? How do we deal with noisy signals in our graphs? For example, if we get our data from observing people talk, are two people who have heated arguments considered to be in the same community? Since a lot social network data comes from sources like Twitter and Facebook where arguments are rampant, how do we distinguish between useful and useless data? More subtly, how do we determine useful information if a group within the social network are trying to mask their discovery? That is, how do we deal with adversarial noise in a graph?

And all of this is just on static graphs! What about graphs that change over time? You can keep making the problem more and more complicated as it gets more realistic.

With the huge wealth of research that has already been done just on the simplest case, and the difficult problems and known barriers to success even for the simple problems, it seems almost intimidating to even begin to try to answer these questions. But maybe that’s what makes them fascinating, not to mention that governments and big businesses pour many millions of dollars into this kind of research.

In the future of this blog we plan to derive and implement some of the basic methods of community detection. This includes, as a first outline, the modularity measure and the walktrap clustering algorithm. Considering that I’m also going to spend a large part of the summer thinking about these problems (indeed, with some of the leading researchers and upcoming stars under the sponsorship of the American Mathematical Society), it’s unlikely to end there.

Until next time!

Stable Marriages and Designing Markets

Here is a fun puzzle. Suppose we have a group of 10 men and 10 women, and each of the men has sorted the women in order of their preference for marriage (that is, a man prefers to marry a woman earlier in his list over a woman later in the list). Likewise, each of the women has sorted the men in order of marriageability. We might ask if there is any way that we, the omniscient cupids of love, can decide who should marry to make everyone happy.

Of course, the word happy is entirely imprecise. The mathematician balks at the prospect of leaving such terms undefined! In this case, it’s quite obvious that not everyone will get their first pick. Indeed, if even two women prefer the same man someone will have to settle for less than their top choice. So if we define happiness in this naive way, the problem is obviously not solvable in general.

Now what if instead of aiming for each individual’s maximum happiness we instead shoot for mutual contentedness? That is, what if “happiness” here means that nobody will ever have an incentive to cheat on their spouse? It turns out that for a mathematical version of this condition, we can always find a suitable set of marriages! These mathematical formalisms include some assumptions, such as that preferences never change and that no new individuals are added to the population. But it is nevertheless an impressive theorem that we can achieve stability no matter what everyone’s preferences are. In this post we’ll give the classical algorithm which constructs so-called “stable marriages,” and we’ll prove its correctness. Then we’ll see a slight generalization of the algorithm, in which the marriages are “polygamous,” and we’ll apply it to the problem of assigning students to internships.

As usual, all of the code used in this post is available for download at this blog’s Github page.

Historical Notes

The original algorithm for computing stable marriages was discovered by Lloyd Shapley and David Gale in the early 1960’s. Shapely and Alvin Roth went on to dedicate much of their career to designing markets and applying the stable marriage problem and its generalizations to such problems. In 2012 they jointly received the Nobel prize in economics for their work on this problem. If you want to know more about what “market design” means and why it’s needed (and you have an hour to spare), consider watching the talk below by Alvin Roth at the Simons Institute’s 2013 Symposium on the Visions of the Theory of Computing. Roth spends most of his time discussing the state of one particular economy, medical students and residence positions at hospitals, which he was asked to redesign. It’s quite a fascinating tale, although some of the deeper remarks assume knowledge of the algorithm we cover in this post.

Alvin Roth went on to apply the ideas presented in the video to economic systems in Boston and New York City public schools, kidney exchanges, and others. They all had the same sort of structure: both parties have preferences and stability makes sense. So he actually imposed the protocol we’re about to describe in order to guarantee that the process terminates to a stable arrangement (and automating it saves everyone involved a lot of time, stress, and money! Watch the video above for more on that).

The Monogamous Stable Marriage Algorithm

Let’s formally set up the problem. Let X = \left \{ 1, 2, \dots, n \right \} be a set of n suitors and Y = \left \{ 1,2,\dots ,n \right \} be a set of n “suited.” Let \textup{pref}_{X \to Y}: X \to S_n be a list of preferences for the suitors. In words, \textup{pref}_{X \to Y} accepts as input a suitor, and produces as output an ordering on the suited members of Y. We denote the output set as S_n, which the group theory folks will recognize as the permutation group on 1, \dots, n. Likewise, there is a function \textup{pref}_{Y \to X}: Y \to S_n describing the preferences of each of the suited.

An example will help clarify these stuffy definitions. If X = \left \{ 1, 2, 3 \right \} and Y = \left \{ 1, 2, 3 \right \}, then to say that

\textup{pref}_{X \to Y}(2) = (3, 1, 2)

is to say that the second suitor prefers the third member of Y the most, and then the first member of Y, and then the second. The programmer might imagine that the datum of the problem consists of two dictionaries (one for X and one for Y) whose keys are integers and whose values are lists of integers which contain 1 through n in some order.

A solution to the problem, then, is a way to match (or marry) suitors with suited. Specifically, a matching is a bijection m: X \to Y, so that x is matched with m(x). The reason we use a bijection is because the marriages are monogamous: only one suitor can be matched with one suited and vice versa. Later we’ll see this condition dropped so we can apply it to a more realistic problem of institutions (suited) which can accommodate many applicants (suitors). Because suitor and suited are awkward to say, we’ll use the familiar, antiquated, and politically incorrect terms “men and women.”

Now if we’re given a monogamous matching m, a pair x \in X, y \in Y is called unstable for m if both x,y prefer each other over their partners assigned by m. That is, (x,y) is unstable for m if y appears before m(y) in the preference list for x, \textup{pref}_{X \to Y}(x), and likewise x appears before m^{-1}(y) in \textup{pref}_{Y \to X}(y).

Another example to clarify: again let X = Y = \left \{ 1,2,3 \right \} and suppose for simplicity that our matching m pairs m(i) = i. If man 2 has the preference list (3,2,1) and woman 3 has the preference list (2,1,3), then 2 and 3 together form an unstable pair for m, because they would rather be with each other over their current partners. That is, they have a mutual incentive to cheat on their spouses. We say that the matching is unstable or admits an unstable pair if there are any unstable pairs for it, and we call the entire matching stable if it doesn’t admit any unstable pairs.

Unlike real life, mathematically unstable marriages need not have constant arguments.

Unlike real life, mathematically unstable marriages need not feature constant arguments.

So the question at hand is: is there an algorithm which, given access to to the two sets of preferences, can efficiently produce a stable matching? We can also wonder whether a stable matching is guaranteed to exist, and the answer is yes. In fact, we’ll prove this and produce an efficient algorithm in one fell swoop.

The central concept of the algorithm is called deferred acceptance. The gist is like this. The algorithm operates in rounds. During each round, each man will “propose” to a woman, and each woman will pick the best proposal available. But the women will not commit to their pick. They instead reject all other suitors, who go on to propose to their second choices in the next round. At that stage each woman (who now may have a more preferred suitor than in the first round) may replace her old pick with a new one. The process continues in this manner until each man is paired with a woman. In this way, each of the women defers accepting any proposal until the end of the round, progressively increasing the quality of her choice. Likewise, the men progressively propose less preferred matches as the rounds progress.

It’s easy to argue such a process must eventually converge. Indeed, the contrary means there’s some sort of cycle in the order of proposals, but each man proposes to only strictly less preferred women than any previous round, and the women can only strictly increase the quality of their held pick. Mathematically, we’re using an important tool called monotonicity. That some quantity can only increase or decrease as time goes on, and since the quantity is bounded, we must eventually reach a local maximum. From there, we can prove that any local maximum satisfies the property we want (here, that the matching is stable), and we win. Indeed, supposing to the contrary that we have a pair (x,y) which is unstable for the matching m produced at the end of this process, then it must have been the case that x proposed to y in some earlier round. But y has as her final match some other suitor x' = m^{-1}(y) whom she prefers less than x. Though she may have never picked x at any point in the algorithm, she can only end up with the worse choice x' if at some point y chose a suitor that was less preferred than the suitor she already had. Since her choices are monotonic this cannot happen, so no unstable pairs can exist.

Rather than mathematically implement the algorithm in pseudocode, let’s produce the entire algorithm in Python to make the ideas completely concrete.

Python Implementation

We start off with some simple data definitions for the two parties which, in the renewed interest of generality, refer to as Suitor and Suited.

class Suitor(object):
   def __init__(self, id, prefList):
      self.prefList = prefList
      self.rejections = 0 # num rejections is also the index of the next option
      self.id = id

   def preference(self):
      return self.prefList[self.rejections]

   def __repr__(self):
      return repr(self.id)

A Suitor is simple enough: he has an id representing his “index” in the set of Suitors, and a preference list prefList which in its i-th position contains the Suitor’s i-th most preferred Suited. This is identical to our mathematical representation from earlier, where a list like (2,3,1) means that the Suitor prefers the second Suited most and the first Suited least. Knowing the algorithm ahead of time, we add an additional piece of data: the number of rejections the Suitor has seen so far. This will double as the index of the Suited that the Suitor is currently proposing to. Indeed, the preference function provides a thin layer of indirection allowing us to ignore the underlying representation, so long as one updates the number of rejections appropriately.

Now for the Suited.

class Suited(object):
   def __init__(self, id, prefList):
      self.prefList = prefList
      self.held = None
      self.currentSuitors = set()
      self.id = id

   def __repr__(self):
      return repr(self.id)

A Suited likewise has a list of preferences and an id, but in addition she has a held attribute for the currently held Suitor, and a list currentSuitors of Suitors that are currently proposing to her. Hence we can define a reject method which accepts no inputs, and returns a list of rejected suitors, while updating the woman’s state to hold onto her most preferred suitor.

   def reject(self):
      if len(self.currentSuitors) == 0:
         return set()

      if self.held is not None:
         self.currentSuitors.add(self.held)

      self.held = min(self.currentSuitors, key=lambda suitor: self.prefList.index(suitor.id))
      rejected = self.currentSuitors - set([self.held])
      self.currentSuitors = set()

      return rejected

The call to min does all the work: finding the Suitor that appears first in her preference list. The rest is bookkeeping. Now the algorithm for finding a stable marriage, following the deferred acceptance algorithm, is simple.

# monogamousStableMarriage: [Suitor], [Suited] -> {Suitor -> Suited}
# construct a stable (monogamous) marriage between suitors and suiteds
def monogamousStableMarriage(suitors, suiteds):
   unassigned = set(suitors)

   while len(unassigned) > 0:
      for suitor in unassigned:
         suiteds[suitor.preference()].currentSuitors.add(suitor)
      unassigned = set()

      for suited in suiteds:
         unassigned |= suited.reject()

      for suitor in unassigned:
         suitor.rejections += 1

   return dict([(suited.held, suited) for suited in suiteds])

All the Suitors are unassigned to begin with. Each iteration of the loop corresponds to a round of the algorithm: the Suitors are added to the currentSuitors list of their next most preferred Suited. Then the Suiteds “simultaneously” reject some Suitors, whose rejection counts are upped by one and returned to the pool of unassigned Suitors. Once every Suited has held onto a Suitor we’re done.

Given a matching, we can define a function that verifies by brute force that the marriage is stable.

# verifyStable: [Suitor], [Suited], {Suitor -> Suited} -> bool
# check that the assignment of suitors to suited is a stable marriage
def verifyStable(suitors, suiteds, marriage):
   import itertools
   suitedToSuitor = dict((v,k) for (k,v) in marriage.items())
   precedes = lambda L, item1, item2: L.index(item1) < L.index(item2)

   def suitorPrefers(suitor, suited):
      return precedes(suitor.prefList, suited.id, marriage[suitor].id)

   def suitedPrefers(suited, suitor):
      return precedes(suited.prefList, suitor.id, suitedToSuitor[suited].id)

   for (suitor, suited) in itertools.product(suitors, suiteds):
      if suited != marriage[suitor] and suitorPrefers(suitor, suited) and suitedPrefers(suited, suitor):
         return False, (suitor.id, suited.id)

   return

Indeed, we can test the algorithm on an instance of the problem.

>>> suitors = [Suitor(0, [3,5,4,2,1,0]), Suitor(1, [2,3,1,0,4,5]),
...            Suitor(2, [5,2,1,0,3,4]), Suitor(3, [0,1,2,3,4,5]),
...            Suitor(4, [4,5,1,2,0,3]), Suitor(5, [0,1,2,3,4,5])]
>>> suiteds = [Suited(0, [3,5,4,2,1,0]), Suited(1, [2,3,1,0,4,5]),
...            Suited(2, [5,2,1,0,3,4]), Suited(3, [0,1,2,3,4,5]),
...            Suited(4, [4,5,1,2,0,3]), Suited(5, [0,1,2,3,4,5])]
>>> marriage = monogamousStableMarriage(suitors, suiteds)
{3: 0, 4: 4, 5: 1, 1: 2, 2: 5, 0: 3}
>>> verifyStable(suitors, suiteds, marriage)
True

We encourage the reader to check this by hand (this one only took two rounds). Even better, answer the question of whether the algorithm could ever require n steps to converge for 2n individuals, where you get to pick the preference list to try to make this scenario happen.

Stable Marriages with Capacity

We can extend this algorithm to work for “polygamous” marriages in which one Suited can accept multiple Suitors. In fact, the two problems are entirely the same! Just imagine duplicating a Suited with large capacity into many Suiteds with capacity of 1. This particular reduction is not very efficient, but it allows us to see that the same proof of convergence and correctness applies. We can then modify our classes and algorithm to account for it, so that (for example) instead of a Suited “holding” a single Suitor, she holds a set of Suitors. We encourage the reader to try extending our code above to the polygamous case as an exercise, and we’ve provided the solution in the code repository for this post on this blog’s Github page.

Ways to Make it Harder

When you study algorithmic graph problems as much as I do, you start to get disheartened. It seems like every problem is NP-hard or worse. So when we get a situation like this, a nice, efficient algorithm with very real consequences and interpretations, you start to get very excited. In between our heaves of excitement, we imagine all the other versions of this problem that we could solve and Nobel prizes we could win. Unfortunately the landscape is bleaker than that, and most extensions of stable marriage problems are NP-complete.

For example, what if we allow ties? That is, one man can be equally happy with two women. This is NP-complete. However, it turns out his extension can be formulated as an integer programming problem, and standard optimization techniques can be used to approximate a solution.

What if, thinking about the problem in terms of medical students and residencies, we allow people to pick their preferences as couples? Some med students are married, after all, and prefer to be close to their spouse even if it means they have a less preferred residency. NP-hard again. See page 53 (pdf page 71) of these notes for a more detailed investigation. The problem is essentially that there is not always a stable matching, and so even determining whether there is one is NP-complete.

So there are a lot of ways to enrich the problem, and there’s an interesting line between tractable and hard in the worst case. As a (relatively difficult) exercise, try to solve the “roommates” version of the problem, where there is no male/female distinction (anyone can be matched with anyone). It turns out to have a tractable solution, and the algorithm is similar to the one outlined in this post.

Until next time!

PS. I originally wrote this post about a year ago when I was contacted by someone in industry who agreed to provide some (anonymized) data listing the preferences of companies and interns applying to work at those companies. Not having heard from them for almost a year, I figure it’s a waste to let this finished post collect dust at the risk of not having an interesting data set. But if you, dear reader, have any data you’d like to provide that fits into the framework of stable marriages, I’d love to feature your company/service on my blog (and solve the matching problem) in exchange for the data. The only caveat is that the data would have to be public, so you would have to anonymize it.

Elliptic Curve Diffie-Hellman

So far in this series we’ve seen elliptic curves from many perspectives, including the elementary, algebraic, and programmatic ones. We implemented finite field arithmetic and connected it to our elliptic curve code. So we’re in a perfect position to feast on the main course: how do we use elliptic curves to actually do cryptography?

History

As the reader has heard countless times in this series, an elliptic curve is a geometric object whose points have a surprising and well-defined notion of addition. That you can add some points on some elliptic curves was a well-known technique since antiquity, discovered by Diophantus. It was not until the mid 19th century that the general question of whether addition always makes sense was answered by Karl Weierstrass. In 1908 Henri Poincaré asked about how one might go about classifying the structure of elliptic curves, and it was not until 1922 that Louis Mordell proved the fundamental theorem of elliptic curves, classifying their algebraic structure for most important fields.

While mathematicians have always been interested in elliptic curves (there is currently a million dollar prize out for a solution to one problem about them), its use in cryptography was not suggested until 1985. Two prominent researchers independently proposed it: Neal Koblitz at the University of Washington, and Victor Miller who was at IBM Research at the time. Their proposal was solid from the start, but elliptic curves didn’t gain traction in practice until around 2005. More recently, the NSA was revealed to have planted vulnerable national standards for elliptic curve cryptography so they could have backdoor access. You can see a proof and implementation of the backdoor at Aris Adamantiadis’s blog. For now we’ll focus on the cryptographic protocols themselves.

The Discrete Logarithm Problem

Koblitz and Miller had insights aplenty, but the central observation in all of this is the following.

Adding is easy on elliptic curves, but undoing addition seems hard.

What I mean by this is usually called the discrete logarithm problem. Here’s a formal definition. Recall that an additive group is just a set of things that have a well-defined addition operation, and the that notation ny means y + y + \dots + y (n times).

Definition: Let G be an additive group, and let x, y be elements of G so that x = ny for some integer n. The discrete logarithm problem asks one to find n when given x and y.

I like to give super formal definitions first, so let’s do a comparison. For integers this problem is very easy. If you give me 12 and 4185072, I can take a few seconds and compute that 4185072 = (348756) 12 using the elementary-school division algorithm (in the above notation, y=12, x=4185072, and n = 348756). The division algorithm for integers is efficient, and so it gives us a nice solution to the discrete logarithm problem for the additive group of integers \mathbb{Z}.

The reason we use the word “logarithm” is because if your group operation is multiplication instead of addition, you’re tasked with solving the equation x = y^n for n. With real numbers you’d take a logarithm of both sides, hence the name. Just in case you were wondering, we can also solve the multiplicative logarithm problem efficiently for rational numbers (and hence for integers) using the square-and-multiply algorithm. Just square y until doing so would make you bigger than x, then multiply by y until you hit x.

But integers are way nicer than they need to be. They are selflessly well-ordered. They give us division for free. It’s a computational charity! What happens when we move to settings where we don’t have a division algorithm? In mathematical lingo: we’re really interested in the case when G is just a group, and doesn’t have additional structure. The less structure we have, the harder it should be to solve problems like the discrete logarithm. Elliptic curves are an excellent example of such a group. There is no sensible ordering for points on an elliptic curve, and we don’t know how to do division efficiently. The best we can do is add y to itself over and over until we hit x, and it could easily happen that n (as a number) is exponentially larger than the number of bits in x and y.

What we really want is a polynomial time algorithm for solving discrete logarithms. Since we can take multiples of a point very fast using the double-and-add algorithm from our previous post, if there is no polynomial time algorithm for the discrete logarithm problem then “taking multiples” fills the role of a theoretical one-way function, and as we’ll see this opens the door for secure communication.

Here’s the formal statement of the discrete logarithm problem for elliptic curves.

Problem: Let E be an elliptic curve over a finite field k. Let P, Q be points on E such that P = nQ for some integer n. Let |P| denote the number of bits needed to describe the point P. We wish to find an algorithm which determines n and has runtime polynomial in |P| + |Q|. If we want to allow randomness, we can require the algorithm to find the correct n with probability at least 2/3.

So this problem seems hard. And when mathematicians and computer scientists try to solve a problem for many years and they can’t, the cryptographers get excited. They start to wonder: under the assumption that the problem has no efficient solution, can we use that as the foundation for a secure communication protocol?

The Diffie-Hellman Protocol and Problem

Let’s spend the rest of this post on the simplest example of a cryptographic protocol based on elliptic curves: the Diffie-Hellman key exchange.

A lot of cryptographic techniques are based on two individuals sharing a secret string, and using that string as the key to encrypt and decrypt their messages. In fact, if you have enough secret shared information, and you only use it once, you can have provably unbreakable encryption! We’ll cover this idea in a future series on the theory of cryptography (it’s called a one-time pad, and it’s not all that complicated). All we need now is motivation to get a shared secret.

Because what if your two individuals have never met before and they want to generate such a shared secret? Worse, what if their only method of communication is being monitored by nefarious foes? Can they possibly exchange public information and use it to construct a shared piece of secret information? Miraculously, the answer is yes, and one way to do it is with the Diffie-Hellman protocol. Rather than explain it abstractly let’s just jump right in and implement it with elliptic curves.

As hinted by the discrete logarithm problem, we only really have one tool here: taking multiples of a point. So say we’ve chosen a curve C and a point on that curve Q. Then we can take some secret integer n, and publish Q and nQ for the world to see. If the discrete logarithm problem is truly hard, then we can rest assured that nobody will be able to discover n.

How can we use this to established a shared secret? This is where Diffie-Hellman comes in. Take our two would-be communicators, Alice and Bob. Alice and Bob each pick a binary string called a secret key, which in interpreted as a number in this protocol. Let’s call Alice’s secret key s_A and Bob’s s_B, and note that they don’t have to be the same. As the name “secret key” suggests, the secret keys are held secret. Moreover, we’ll assume that everything else in this protocol, including all data sent between the two parties, is public.

So Alice and Bob agree ahead of time on a public elliptic curve C and a public point Q on C. We’ll sometimes call this point the base point for the protocol.

Bob can cunningly do the following trick: take his secret key s_B and send s_B Q to Alice. Equally slick Alice computes s_A Q and sends that to Bob. Now Alice, having s_B Q , computes s_A s_B Q. And Bob, since he has s_A Q, can compute s_B s_A Q. But since addition is commutative in elliptic curve groups, we know s_A s_B Q = s_B s_A Q. The secret piece of shared information can be anything derived from this new point, for example its x-coordinate.

If we want to talk about security, we have to describe what is public and what the attacker is trying to determine. In this case the public information consists of the points Q, s_AQ, s_BQ. What is the attacker trying to figure out? Well she really wants to eavesdrop on their subsequent conversation, that is, the stuff that encrypt with their new shared secret s_As_BQ. So the attacker wants find out s_As_BQ. And we’ll call this the Diffie-Hellman problem.

Diffie-Hellman Problem: Suppose you fix an elliptic curve E over a finite field k, and you’re given four points Q, aQ, bQ and P for some unknown integers a, b. Determine if P = abQ in polynomial time (in the lengths of Q, aQ, bQ, P).

On one hand, if we had an efficient solution to the discrete logarithm problem, we could easily use that to solve the Diffie-Hellman problem because we could compute a,b and them quickly compute abQ and check if it’s P. In other words discrete log is at least as hard as this problem. On the other hand nobody knows if you can do this without solving the discrete logarithm problem. Moreover, we’re making this problem as easy as we reasonably can because we don’t require you to be able to compute abQ. Even if some prankster gave you a candidate for abQ, all you have to do is check if it’s correct. One could imagine some test that rules out all fakes but still doesn’t allow us to compute the true point, which would be one way to solve this problem without being able to solve discrete log.

So this is our hardness assumption: assuming this problem has no efficient solution then no attacker, even with really lucky guesses, can feasibly determine Alice and Bob’s shared secret.

Python Implementation

The Diffie-Hellman protocol is just as easy to implement as you would expect. Here’s some Python code that does the trick. Note that all the code produced in the making of this post is available on this blog’s Github page.

def sendDH(privateKey, generator, sendFunction):
   return sendFunction(privateKey * generator)

def receiveDH(privateKey, receiveFunction):
   return privateKey * receiveFunction()

And using our code from the previous posts in this series we can run it on a small test.

import os

def generateSecretKey(numBits):
   return int.from_bytes(os.urandom(numBits // 8), byteorder='big')

if __name__ == "__main__":
   F = FiniteField(3851, 1)
   curve = EllipticCurve(a=F(324), b=F(1287))
   basePoint = Point(curve, F(920), F(303))

   aliceSecretKey = generateSecretKey(8)
   bobSecretKey = generateSecretKey(8)

   alicePublicKey = sendDH(aliceSecretKey, basePoint, lambda x:x)
   bobPublicKey = sendDH(bobSecretKey, basePoint, lambda x:x)

   sharedSecret1 = receiveDH(bobSecretKey, lambda: alicePublicKey)
   sharedSecret2 = receiveDH(aliceSecretKey, lambda: bobPublicKey)
   print('Shared secret is %s == %s' % (sharedSecret1, sharedSecret2))

Pythons os module allows us to access the operating system’s random number generator (which is supposed to be cryptographically secure) via the function urandom, which accepts as input the number of bytes you wish to generate, and produces as output a Python bytestring object that we then convert to an integer. Our simplistic (and totally insecure!) protocol uses the elliptic curve C defined by y^2 = x^3 + 324 x + 1287 over the finite field \mathbb{Z}/3851. We pick the base point Q = (920, 303), and call the relevant functions with placeholders for actual network transmission functions.

There is one issue we have to note. Say we fix our base point Q. Since an elliptic curve over a finite field can only have finitely many points (since the field only has finitely many possible pairs of numbers), it will eventually happen that nQ = 0 is the ideal point. Recall that the smallest value of n for which nQ = 0 is called the order of Q. And so when we’re generating secret keys, we have to pick them to be smaller than the order of the base point. Viewed from the other angle, we want to pick Q to have large order, so that we can pick large and difficult-to-guess secret keys. In fact, no matter what integer you use for the secret key it will be equivalent to some secret key that’s less than the order of Q. So if an attacker could guess the smaller secret key he wouldn’t need to know your larger key.

The base point we picked in the example above happens to have order 1964, so an 8-bit key is well within the bounds. A real industry-strength elliptic curve (say, Curve25519 or the curves used in the NIST standards*) is designed to avoid these problems. The order of the base point used in the Diffie-Hellman protocol for Curve25519 has gargantuan order (like 2^{256}). So 256-bit keys can easily be used. I’m brushing some important details under the rug, because the key as an actual string is derived from 256 pseudorandom bits in a highly nontrivial way.

So there we have it: a simple cryptographic protocol based on elliptic curves. While we didn’t experiment with a truly secure elliptic curve in this example, we’ll eventually extend our work to include Curve25519. But before we do that we want to explore some of the other algorithms based on elliptic curves, including random number generation and factoring.

Comments on Insecurity

Why do we use elliptic curves for this? Why not do something like RSA and do multiplication (and exponentiation) modulo some large prime?

Well, it turns out that algorithmic techniques are getting better and better at solving the discrete logarithm problem for integers mod p, leading some to claim that RSA is dead. But even if we will never find a genuinely efficient algorithm (polynomial time is good, but might not be good enough), these techniques have made it clear that the key size required to maintain high security in RSA-type protocols needs to be really big. Like 4096 bits. But for elliptic curves we can get away with 256-bit keys. The reason for this is essentially mathematical: addition on elliptic curves is not as well understood as multiplication is for integers, and the more complex structure of the group makes it seem inherently more difficult. So until some powerful general attacks are found, it seems that we can get away with higher security on elliptic curves with smaller key sizes.

I mentioned that the particular elliptic curve we chose was insecure, and this raises the natural question: what makes an elliptic curve/field/basepoint combination secure or insecure? There are a few mathematical pitfalls (including certain attacks we won’t address), but one major non-mathematical problem is called a side-channel attack. A side channel attack against a cryptographic protocol is one that gains additional information about users’ secret information by monitoring side-effects of the physical implementation of the algorithm.

The problem is that different operations, doubling a point and adding two different points, have very different algorithms. As a result, they take different amounts of time to complete and they require differing amounts of power. Both of these can be used to reveal information about the secret keys. Despite the different algorithms for arithmetic on Weierstrass normal form curves, one can still implement them to be secure. Naively, one might pad the two subroutines with additional (useless) operations so that they have more similar time/power signatures, but I imagine there are better methods available.

But much of what makes a curve’s domain parameters mathematically secure or insecure is still unknown. There are a handful of known attacks against very specific families of parameters, and so cryptography experts simply avoid these as they are discovered. Here is a short list of pitfalls, and links to overviews:

  1. Make sure the order of your basepoint has a short facorization (e.g., is 2p, 3p, or 4p for some prime p). Otherwise you risk attacks based on the Chinese Remainder Theorem, the most prominent of which is called Pohlig-Hellman.
  2. Make sure your curve is not supersingular. If it is you can reduce the discrete logarithm problem to one in a different and much simpler group.
  3. If your curve C is defined over \mathbb{Z}/p, make sure the number of points on C is not equal to p. Such a curve is called prime-field anomalous, and its discrete logarithm problem can be reduced to the (additive) version on integers.
  4. Don’t pick a small underlying field like \mathbb{F}_{2^m} for small mGeneral-purpose attacks can be sped up significantly against such fields.
  5. If you use the field \mathbb{F}_{2^m}, ensure that m is prime. Many believe that if m has small divisors, attacks based on some very complicated algebraic geometry can be used to solve the discrete logarithm problem more efficiently than any general-purpose method. This gives evidence that m being composite at all is dangerous, so we might as well make it prime.

This is a sublist of the list provided on page 28 of this white paper.

The interesting thing is that there is little about the algorithm and protocol that is vulnerable. Almost all of the vulnerabilities come from using bad curves, bad fields, or a bad basepoint. Since the known attacks work on a pretty small subset of parameters, one potentially secure technique is to just generate a random curve and a random point on that curve! But apparently all respected national agencies will refuse to call your algorithm “standards compliant” if you do this.

Next time we’ll continue implementing cryptographic protocols, including the more general public-key message sending and signing protocols.

Until then!