Problem: Estimate the number of distinct items in a data stream that is too large to fit in memory.
Solution: (in python)
import random
def randomHash(modulus):
a, b = random.randint(0,modulus-1), random.randint(0,modulus-1)
def f(x):
return (a*x + b) % modulus
return f
def average(L):
return sum(L) / len(L)
def numDistinctElements(stream, numParallelHashes=10):
modulus = 2**20
hashes = [randomHash(modulus) for _ in range(numParallelHashes)]
minima = [modulus] * numParallelHashes
currentEstimate = 0
for i in stream:
hashValues = [h(i) for h in hashes]
for i, newValue in enumerate(hashValues):
if newValue < minima[i]:
minima[i] = newValue
currentEstimate = modulus / average(minima)
yield currentEstimate
Discussion: The technique used here is to use random hash functions. The central idea is the same as the general principle presented in our recent post on hashing for load balancing. In particular, if you have an algorithm that works under the assumption that the data is uniformly random, then the same algorithm will work (up to a good approximation) if you process the data through a randomly chosen hash function.
So if we assume the data in the stream consists of $ N$ uniformly random real numbers between zero and one, what we would do is the following. Maintain a single number $ x_{\textup{min}}$ representing the minimum element in the list, and update it every time we encounter a smaller number in the stream. A simple probability calculation or an argument by symmetry shows that the expected value of the minimum is $ 1/(N+1)$. So your estimate would be $ 1/(x_{\textup{min}}+1)$. (The extra +1 does not change much as we’ll see.) One can spend some time thinking about the variance of this estimate (indeed, our earlier post is great guidance for how such a calculation would work), but since the data is not random we need to do more work. If the elements are actually integers between zero and $ k$, then this estimate can be scaled by $ k$ and everything basically works out the same.
Processing the data through a hash function $ h$ chosen randomly from a 2-universal family (and we proved in the aforementioned post that this modulus thing is 2-universal) makes the outputs “essentially random” enough to have the above technique work with some small loss in accuracy. And to reduce variance, you can process the stream in parallel with many random hash functions. This rough sketch results in the code above. Indeed, before I state a formal theorem, let’s see the above code in action. First on truly random data:
S = [random.randint(1,2**20) for _ in range(10000)]
for k in range(10,301,10):
for est in numDistinctElements(S, k):
pass
print(abs(est))
# output
18299.75567190227
7940.7497160166595
12034.154552410098
12387.19432959244
15205.56844547564
8409.913113220158
8057.99978043693
9987.627098464103
10313.862295081966
9084.872639057356
10952.745228373375
10360.569781803211
11022.469475216301
9741.250165892501
11474.896038520465
10538.452261306533
10068.793492995934
10100.266495424627
9780.532155130093
8806.382800033594
10354.11482578643
10001.59202254498
10623.87031408308
9400.404915767062
10710.246772348424
10210.087633885101
9943.64709187974
10459.610972568578
10159.60175069326
9213.120899718839
As you can see the output is never off by more than a factor of 2. Now with “adversarial data.”
S = range(10000) #[random.randint(1,2**20) for _ in range(10000)]
for k in range(10,301,10):
for est in numDistinctElements(S, k):
pass
print(abs(est))
# output
12192.744186046511
15935.80547112462
10167.188106011634
12977.425742574258
6454.364151175674
7405.197740112994
11247.367453263867
4261.854392115023
8453.228233608026
7706.717624577393
7582.891328643745
5152.918628936483
1996.9365093316926
8319.20208545846
3259.0787592465967
6812.252720480753
4975.796789951151
8456.258064516129
8851.10133724288
7317.348220516398
10527.871485943775
3999.76974425661
3696.2999065091117
8308.843106180666
6740.999794281012
8468.603733730935
5728.532232608959
5822.072220349402
6382.349459544548
8734.008940222673
The estimates here are off by a factor of up to 5, and this estimate seems to get better as the number of hash functions used increases. The formal theorem is this:
Theorem: If $ S$ is the set of distinct items in the stream and $ n = |S|$ and $ m > 100 n$, then with probability at least 2/3 the estimate $ m / x_{\textup{min}}$ is between $ n/6$ and $ 6n$.
We omit the proof (see below for references and better methods). As a quick analysis, since we’re only storing a constant number of integers at any given step, the algorithm has space requirement $ O(\log m) = O(\log n)$, and each step takes time polynomial in $ \log(m)$ to update in each step (since we have to compute multiplication and modulus of $ m$).
This method is just the first ripple in a lake of research on this topic. The general area is called “streaming algorithms,” or “sublinear algorithms.” This particular problem, called cardinality estimation, is related to a family of problems called estimating frequency moments. The literature gets pretty involved in the various tradeoffs between space requirements and processing time per stream element.
As far as estimating cardinality goes, the first major results were due to Flajolet and Martin in 1983, where they provided a slightly more involved version of the above algorithm, which uses logarithmic space.
Later revisions to the algorithm (2003) got the space requirement down to $ O(\log \log n)$, which is exponentially better than our solution. And further tweaks and analysis improved the variance bounds to something like a multiplicative factor of $ \sqrt{m}$. This is called the HyperLogLog algorithm, and it has been tested in practice at Google.
Finally, a theoretically optimal algorithm (achieving an arbitrarily good estimate with logarithmic space) was presented and analyzed by Kane et al in 2010.
Greedy algorithms are among the simplest and most intuitive algorithms known to humans. Their name essentially gives their description: do the thing that looks best right now, and repeat until nothing looks good anymore or you’re forced to stop. Some of the best situations in computer science are also when greedy algorithms are optimal or near-optimal. There is a beautiful theory of this situation, known as the theory of matroids. We haven’t covered matroids on this blog (edit: we did), but in this post we will focus on the next best thing: when the greedy algorithm guarantees a reasonably good approximation to the optimal solution.
This situation isn’t hard to formalize, and we’ll make it as abstract as possible. Say you have a set of objects $ X$, and you’re looking to find the “best” subset $ S \subset X$. Here “best” is just measured by a fixed (known, efficiently computable) objective function $ f : 2^X \to \mathbb{R}$. That is, $ f$ accepts as input subsets of $ X$ and outputs numbers so that better subsets have larger numbers. Then the goal is to find a subset maximizing $ f$.
In this generality the problem is clearly impossible. You’d have to check all subsets to be sure you didn’t miss the best one. So what conditions do we need on either $ X$ or $ f$ or both that makes this problem tractable? There are plenty you could try, but one very rich property is submodularity.
The Submodularity Condition
I think the simplest way to explain submodularity is in terms of coverage. Say you’re starting a new radio show and you have to choose which radio stations to broadcast from to reach the largest number of listeners. For simplicity say each radio station has one tower it broadcasts from, and you have a good estimate of the number of listeners you would reach if you broadcast from a given tower. For more simplicity, say it costs the same to broadcast from each tower, and your budget restricts you to a maximum of ten stations to broadcast from. So the question is: how do you pick towers to maximize your overall reach?
The hidden condition here is that some towers overlap in which listeners they reach. So if you broadcast from two towers in the same city, a listener who has access to both will just pick one or the other. In other words, there’s a diminished benefit to picking two overlapping towers if you already have chosen one.
In our version of the problem, picking both of these towers has some small amount of “overkill.”
This “diminishing returns” condition is a general idea you can impose on any function that takes in subsets of a given set and produces numbers. If $ X$ is a set, then for a strange reason we denote by $ 2^X$ the set of all subsets of $ X$. So we can state this condition more formally,
Definition: Let $ X$ be a finite set. A function $ f: 2^X \to \mathbb{R}$ is called submodular if for all subsets $ S \subset T \subset X$ and all $ x \in X \setminus T$,
$ \displaystyle f(S \cup \{ x \}) – f(S) \geq f(T \cup \{ x \}) – f(T)$
In other words, if $ f$ measures “benefit,” then the marginal benefit of adding $ x$ to $ S$ is at least as high as the marginal benefit of adding it to $ T$. Since $ S \subset T$ and $ x$ are all arbitrary, this is as general as one could possibly make it: adding $ x$ to a bigger set can’t be better than adding it to a smaller set.
Before we start doing things with submodular functions, let’s explore some basic properties. The first is an equivalent definition of submodularity
Proposition: $ f$ is submodular if and only if for all $ A, B \subset X$, it holds that
$ \displaystyle f(A \cap B) + f(A \cup B) \leq f(A) + f(B)$.
Proof. If we assume $ f$ has the condition from this proposition, then we can set $ A=T, B=S \cup \{ x \}$, and the formula just works out. Conversely, if we have the condition from the definition, then using the fact that $ A \cap B \subset B$ we can inductively apply the inequality to each element of $ A \setminus B$ to get
Next, we can tweak and combine submodular functions to get more submodular functions. In particular, non-negative linear combinations of sub-modular functions are submodular. In other words, if $ f_1, \dots, f_k$ are submodular on the same set $ X$, and $ \alpha_1, \dots, \alpha_k$ are all non-negative reals, then $ \alpha_1 f_1 + \dots + \alpha_k f_k$ is also a submodular function on $ X$. It’s an easy exercise in applying the definition to see why this is true. This is important because when we’re designing objectives to maximize, we can design them by making some simple submodular pieces, and then picking an appropriate combination of those pieces.
The second property we need to impose on a submodular function is monotonicity. That is, as your sets get more elements added to them, their value under $ f$ only goes up. In other words, $ f$ is monotone when $ S \subset T$ then $ f(S) \leq f(T)$. An interesting property of functions that are both submodular and monotone is that the truncation of such a function is also submodular and monotone. In other words, $ \textup{min}(f(S), c)$ is still submodular when $ f$ is monotone submodular and $ c$ is a constant.
Submodularity and Monotonicity Give 1 – 1/e
The wonderful thing about submodular functions is that we have a lot of great algorithmic guarantees for working with them. We’ll prove right now that the coverage problem (while it might be hard to solve in general) can be approximated pretty well by the greedy algorithm.
Here’s the algorithmic setup. I give you a finite set $ X$ and an efficient black-box to evaluate $ f(S)$ for any subset $ S \subset X$ you want. I promise you that $ f$ is monotone and submodular. Now I give you an integer $ k$ between 1 and the size of $ X$, and your task is to quickly find a set $ S$ of size $ k$ for which $ f(S)$ is maximal among all subsets of size $ k$. That is, you design an algorithm that will work for any $ k, X, f$ and runs in polynomial time in the sizes of $ X, k$.
In general this problem is NP-hard, meaning you’re not going to find a solution that works in the worst case (if you do, don’t call me; just claim your million dollar prize). So how well can we approximate the optimal value for $ f(S)$ by a different set of size $ k$? The beauty is that, if your function is monotone and submodular, you can guarantee to get within 63% of the optimum. The hope (and reality) is that in practice it will often perform much better, but still this is pretty good! More formally,
Theorem: Let $ f$ be a monotone, submodular, non-negative function on $ X$. The greedy algorithm, which starts with $ S$ as the empty set and at every step picks an element $ x$ which maximizes the marginal benefit $ f(S \cup \{ x \}) – f(S)$, provides a set $ S$ that achieves a $ (1- 1/e)$-approximation of the optimum.
We’ll prove this in just a little bit more generality, and the generality is quite useful. If we call $ S_1, S_2, \dots, S_l$ the sets chosen by the greedy algorithm (where now we might run the greedy algorithm for $ l > k$ steps), then for all $ l, k$, we have
This allows us to run the algorithm for more than $ k$ steps to get a better approximation by sets of larger size, and quantify how much better the guarantee on that approximation would be. It’s like an algorithmic way of hedging your risk. So let’s prove it.
Proof. Let’s set up some notation first. Fix your $ l$ and $ k$, call $ S_i$ the set chosen by the greedy algorithm at step $ i$, and call $ S^*$ the optimal subset of size $ k$. Further call $ \textup{OPT}$ the value of the best set $ f(S^*)$. Call $ x_1^*, \dots, x_k^*$ the elements of $ S^*$ (the order is irrelevant). Now for every $ i < l$ monotonicity gives us $ f(S^*) \leq f(S^* \cup S_i)$. We can unravel this into a sum of marginal gains of adding single elements. The first step is
The second step removes $ x_{k-1}^*$, from the last term, the third removes $ x_{k-2}^*$, and so on until we have removed all of $ S^*$ and get this sum
Now, applying submodularity, we can change all of these marginal benefits of “adding one more $ S^*$ element to $ S_i$ already with some $ S^*$ stuff” to “adding one more $ S^*$ element to just $ S_i$.” In symbols, the equation above is at most
Chaining all of these together, we have $ f(S^*) – f(S_i) \leq k(f(S_{i+1}) – f(S_i))$. If we call $ a_{i} = f(S^*) – f(S_i)$, then this inequality can be rewritten as $ a_{i+1} \leq (1 – 1/k) a_{i}$. Now by induction we can relate $ a_l \leq (1 – 1/k)^l a_0$. Now use the fact that $ a_0 \leq f(S^*)$ and the common inequality $ 1-x \leq e^{-x}$ to get
And rearranging gives $ f(S_l) \geq (1 – e^{-l/k}) f(S^*)$.
$ \square$
Setting $ l=k$ gives the approximation bound we promised. But note that allowing the greedy algorithm to run longer can give much stronger guarantees, though it requires you to sacrifice the cardinality constraint. $ 1 – 1/e$ is about 63%, but doubling the size of $ S$ gives about an 86% approximation guarantee. This is great for people in the real world, because you can quantify the gains you’d get by relaxing the constraints imposed on you (which are rarely set in stone).
So this is really great! We have quantifiable guarantees on a stupidly simple algorithm, and the setting is super general. And so if you have your problem and you manage to prove your function is submodular (this is often the hardest part), then you are likely to get this nice guarantee.
Extensions and Variations
This result on monotone submodular functions is just one part of a vast literature on finding approximation algorithms for submodular functions in various settings. In closing this post we’ll survey some of the highlights and provide references.
What we did in this post was maximize a monotone submodular function subject to a cardinality constraint $ |S| \leq k$. There are three basic variations we could do: we could drop constraints and see whether we can still get guarantees, we could look at minimization instead of maximization, and we could modify the kinds of constraints we impose on the solution.
There are a ton of different kinds of constraints, and we’ll discuss two. The first is where you need to get a certain value $ f(S) \geq q$, and you want to find the smallest set that achieves this value. Laurence Wolsey (who proved a lot of these theorems) showed in 1982 that a slight variant of the greedy algorithm can achieve a set whose size is a multiplicative factor of $ 1 + \log (\max_x f(\{ x \}))$ worse than the optimum.
The second kind of constraint is a generalization of a cardinality constraint called a knapsack constraint. This means that each item $ x \in X$ has a cost, and you have a finite budget with which to spend on elements you add to $ S$. One might expect this natural extension of the greedy algorithm to work: pick the element which maximizes the ratio of increasing the value of $ f$ to the cost (within your available budget). Unfortunately this algorithm can perform arbitrarily poorly, but there are two fun caveats. The first is that if you do both this augmented greedy algorithm and the greedy algorithm that ignores costs, then at least one of these can’t do too poorly. Specifically, one of them has to get at least a 30% approximation. This was shown by Leskovec et al in 2007. The second is that if you’re willing to spend more time in your greedy step by choosing the best subset of size 3, then you can get back to the $ 1-1/e$ approximation. This was shown by Sviridenko in 2004.
Now we could try dropping the monotonicity constraint. In this setting cardinality constraints are also superfluous, because it could be that the very large sets have low values. Now it turns out that if $ f$ has no other restrictions (in particular, if it’s allowed to be negative), then even telling whether there’s a set $ S$ with $ f(S) > 0$ is NP-hard, but the optimum could be arbitrarily large and positive when it exists. But if you require that $ f$ is non-negative, then you can get a 1/3-approximation, if you’re willing to add randomness you can get 2/5 in expectation, and with more subtle constraints you can get up to a 1/2 approximation. Anything better is NP-hard. Fiege, Mirrokni, and Vondrak have a nice FOCS paper on this.
Next, we could remove the monotonicity property and try to minimize the value of $ f(S)$. It turns out that this problem always has an efficient solution, but the only algorithm I have heard of to solve it involves a very sophisticated technique called the ellipsoid algorithm. This is heavily related to linear programming and convex optimization, something which I hope to cover in more detail on this blog.
Finally, there are many interesting variations in the algorithmic procedure. For example, one could require that the elements are provided in some order (the streaming setting), and you have to pick at each step whether to put the element in your set or not. Alternatively, the objective functions might not be known ahead of time and you have to try to pick elements to jointly maximize them as they are revealed. These two settings have connections to bandit learning problems, which we’ve covered before on this blog. See this survey of Krause and Golovin for more on the connections, which also contains the main proof used in this post.
Indeed, despite the fact that many of the big results were proved in the 80’s, the analysis of submodular functions is still a big research topic. There was even a paper posted just the other day on the arXiv about its relation to ad serving! And wouldn’t you know, they proved a $ (1-1/e)$-approximation for their setting. There’s just something about $ 1-1/e$.
Graphs are among the most interesting and useful objects in mathematics. Any situation or idea that can be described by objects with connections is a graph, and one of the most prominent examples of a real-world graph that one can come up with is a social network.
Recall, if you aren’t already familiar with this blog’s gentle introduction to graphs, that a graph $ G$ is defined by a set of vertices $ V$, and a set of edges $ E$, each of which connects two vertices. For this post the edges will be undirected, meaning connections between vertices are symmetric.
One of the most common topics to talk about for graphs is the notion of a community. But what does one actually mean by that word? It’s easy to give an informal definition: a subset of vertices $ C$ such that there are many more edges between vertices in $ C$ than from vertices in $ C$ to vertices in $ V – C$ (the complement of $ C$). Try to make this notion precise, however, and you open a door to a world of difficult problems and open research questions. Indeed, nobody has yet come to a conclusive and useful definition of what it means to be a community. In this post we’ll see why this is such a hard problem, and we’ll see that it mostly has to do with the word “useful.” In future posts we plan to cover some techniques that have found widespread success in practice, but this post is intended to impress upon the reader how difficult the problem is.
The simplest idea
The simplest thing to do is to say a community is a subset of vertices which are completely connected to each other. In the technical parlance, a community is a subgraph which forms a clique. Sometimes an $ n$-clique is also called a complete graph on $ n$ vertices, denoted $ K_n$. Here’s an example of a 5-clique in a larger graph:
“Where’s Waldo” for graph theorists: a clique hidden in a larger graph.
Indeed, it seems reasonable that if we can reliably find communities at all, then we should be able to find cliques. But as fate should have it, this problem is known to be computationally intractable. In more detail, the problem of finding the largest clique in a graph is NP-hard. That essentially means we don’t have any better algorithms to find cliques in general graphs than to try all possible subsets of the vertices and check to see which, if any, form cliques. In fact it’s much worse, this problem is known to be hard to approximate to any reasonable factor in the worst case (the error of the approximation grows polynomially with the size of the graph!). So we can’t even hope to find a clique half the size of the biggest, or a thousandth the size!
But we have to take these impossibility results with a grain of salt: they only say things about the worst case graphs. And when we’re looking for communities in the real world, the worst case will never show up. Really, it won’t! In these proofs, “worst case” means that they encode some arbitrarily convoluted logic problem into a graph, so that finding the clique means solving the logic problem. To think that someone could engineer their social network to encode difficult logic problems is ridiculous.
So what about an “average case” graph? To formulate this typically means we need to consider graphs randomly drawn from a distribution.
Random graphs
The simplest kind of “randomized” graph you could have is the following. You fix some set of vertices, and then run an experiment: for each pair of vertices you flip a coin, and if the coin is heads you place an edge and otherwise you don’t. This defines a distribution on graphs called $ G(n, 1/2)$, which we can generalize to $ G(n, p)$ for a coin with bias $ p$. With a slight abuse of notation, we call $ G(n, p)$ the Erdős–Rényi random graph (it’s not a graph but a distribution on graphs). We explored this topic form a more mathematical perspective earlier on this blog.
So we can sample from this distribution and ask questions like: what’s the probability of the largest clique being size at least $ 20$? Indeed, cliques in Erdős–Rényi random graphs are so well understood that we know exactly how they work. For example, if $ p=1/2$ then the size of the largest clique is guaranteed (with overwhelming probability as $ n$ grows) to have size $ k(n)$ or $ k(n)+1$, where $ k(n)$ is about $ 2 \log n$. Just as much is known about other values of $ p$ as well as other properties of $ G(n,p)$, see Wikipedia for a short list.
In other words, if we wanted to find the largest clique in an Erdős–Rényirandom graph, we could check all subsets of size roughly $ 2\log(n)$, which would take about $ (n / \log(n))^{\log(n)}$ time. This is pretty terrible, and I’ve never heard of an algorithm that does better (contrary to the original statement in this paragraph that showed I can’t count). In any case, it turns out that the Erdős–Rényi random graph, and using cliques to represent communities, is far from realistic. There are many reasons why this is the case, but here’s one example that fits with the topic at hand. If I thought the world’s social network was distributed according to $ G(n, 1/2)$ and communities were cliques, then I would be claiming that the largest community is of size 65 or 66. Estimated world population: 7 billion, $ 2 \log(7 \cdot 10^9) \sim 65$. Clearly this is ridiculous: there are groups of larger than 66 people that we would want to call “communities,” and there are plenty of communities that don’t form bona-fide cliques.
Another avenue shows that things are still not as easy as they seem in Erdős–Rényi land. This is the so-called planted clique problem. That is, you draw a graph $ G$ from $ G(n, 1/2)$. You give $ G$ to me and I pick a random but secret subset of $ r$ vertices and I add enough edges to make those vertices form an $ r$-clique. Then I ask you to find the $ r$-clique. Clearly it doesn’t make sense when $ r < 2 \log (n) $ because you won’t be able to tell it apart from the guaranteed cliques in $ G$. But even worse, nobody knows how to find the planted clique when $ r$ is even a little bit smaller than $ \sqrt{n}$ (like, $ r = n^{9/20}$ even). Just to solidify this with some numbers, we don’t know how to reliably find a planted clique of size 60 in a random graph on ten thousand vertices, but we do when the size of the clique goes up to 100. The best algorithms we know rely on some sophisticated tools in spectral graph theory, and their details are beyond the scope of this post.
So Erdős–Rényi graphs seem to have no hope. What’s next? There are a couple of routes we can take from here. We can try to change our random graph model to be more realistic. We can relax our notion of communities from cliques to something else. We can do both, or we can do something completely different.
Other kinds of random graphs
There is an interesting model of Barabási and Albert, often called the “preferential attachment” model, that has been described as a good model of large, quickly growing networks like the internet. Here’s the idea: you start off with a two-clique $ G = K_2$, and at each time step $ t$ you add a new vertex $ v$ to $ G$, and new edges so that the probability that the edge $ (v,w)$ is added to $ G$ is proportional to the degree of $ w$ (as a fraction of the total number of edges in $ G$). Here’s an animation of this process:
Image source: Wikipedia
The significance of this random model is that it creates graphs with a small number of hubs, and a large number of low-degree vertices. In other words, the preferential attachment model tends to “make the rich richer.” Another perspective is that the degree distribution of such a graph is guaranteed to fit a so-called power-law distribution. Informally, this means that the overall fraction of small-degree vertices gives a significant contribution to the total number of edges. This is sometimes called a “fat-tailed” distribution. Since power-law distributions are observed in a wide variety of natural settings, some have used this as justification for working in the preferential attachment setting. On the other hand, this model is known to have no significant community structure (by any reasonable definition, certainly not having cliques of nontrivial size), and this has been used as evidence against the model. I am not aware of any work done on planting dense subgraphs in graphs drawn from a preferential attachment model, but I think it’s likely to be trivial and uninteresting. On the other hand, Bubeck et al. have looked at changing the initial graph (the “seed”) from a 2-clique to something else, and seeing how that affects the overall limiting distribution.
Another model that often shows up is a model that allows one to make a random graph starting with any fixed degree distribution, not just a power law. There are a number of models that do this to some fashion, and you’ll hear a lot of hyphenated names thrown around like Chung-Lu and Molloy-Reed and Newman-Strogatz-Watts. The one we’ll describe is quite simple. Say you start with a set of vertices $ V$, and a number $ d_v$ for each vertex $ v$, such that the sum of all the $ d_v$ is even. This condition is required because in any graph the sum of the degrees of a vertex is twice the number of edges. Then you imagine each vertex $ v$ having $ d_v$ “edge-stubs.” The name suggests a picture like the one below:
Each node has a prescribed number of “edge stubs,” which are randomly connected to form a graph.
Now you pick two edge stubs at random and connect them. One usually allows self-loops and multiple edges between vertices, so that it’s okay to pick two edge stubs from the same vertex. You keep doing this until all the edge stubs are accounted for, and this is your random graph. The degrees were fixed at the beginning, so the only randomization is in which vertices are adjacent. The same obvious biases apply, that any given vertex is more likely to be adjacent to high-degree vertices, but now we get to control the biases with much more precision.
The reason such a model is useful is that when you’re working with graphs in the real world, you usually have statistical information available. It’s simple to compute the degree of each vertex, and so you can use this random graph as a sort of “prior” distribution and look for anomalies. In particular, this is precisely how one of the leading measures of community structure works: the measure of modularity. We’ll talk about this in the next section.
Other kinds of communities
Here’s one easy way to relax our notion of communities. Rather than finding complete subgraphs, we could ask about finding very dense subgraphs (ignoring what happens outside the subgraph). We compute density as the average degree of vertices in the subgraph.
If we impose no bound on the size of the subgraph an algorithm is allowed to output, then there is an efficient algorithm for finding the densest subgraph in a given graph. The general exact solution involves solving a linear programming problem and a little extra work, but luckily there is a greedy algorithm that can get within half of the optimal density. You start with all the vertices $ S_n = V$, and remove any vertex of minimal degree to get $ S_{n-1}$. Continue until $ S_0$, and then compute the density of all the $ S_i$. The best one is guaranteed to be at least half of the optimal density. See this paper of Moses Charikar for a more formal analysis.
One problem with this is that the size of the densest subgraph might be too big. Unfortunately, if you fix the size of the dense subgraph you’re looking for (say, you want to find the densest subgraph of size at most $ k$ where $ k$ is an input), then the problem once again becomes NP-hard and suffers from the same sort of inapproximability theorems as finding the largest clique.
A more important issue with this is that a dense subgraph isn’t necessarily a community. In particular, we want communities to be dense on the inside and sparse on the outside. The densest subgraph analysis, however, might rate the following graph as one big dense subgraph instead of two separately dense communities with some modest (but not too modest) amount of connections between them.
What are the correct communities here?
Indeed, we want a quantifiable a notion of “dense on the inside and sparse on the outside.” One such formalization is called modularity. Modularity works as follows. If you give me some partition of the vertices of $ G$ into two sets, modularity measures how well this partition reflects two separate communities. It’s the definition of “community” here that makes it interesting. Rather than ask about densities exactly, you can compare the densities to the expected densities in a given random graph model.
In particular, we can use the fixed-degree distribution model from the last section. If we know the degrees of all the vertices ahead of time, we can compute the probability that we see some number of edges going between the two pieces of the partition relative to what we would see at random. If the difference is large (and largely biased toward fewer edges across the partition and more edges within the two subsets), then we say it has high modularity. This involves a lot of computations — the whole measure can be written as a quadratic form via one big matrix — but the idea is simple enough. We intend to write more about modularity and implement the algorithm on this blog, but the excited reader can see the original paper of M.E.J. Newman.
Now modularity is very popular but it too has shortcomings. First, even though you can compute the modularity of a given partition, there’s still the problem of finding the partition that globally maximizes modularity. Sadly, this is known to be NP-hard. Mover, it’s known to be NP-hard even if you’re just trying to find a partition into two pieces that maximizes modularity, and even still when the graph is regular (every vertex has the same degree).
Still worse, while there are some readily accepted heuristics that often “do well enough” in practice, we don’t even know how to approximate modularity very well. Bhaskar DasGupta has a line of work studying approximations of maximum modularity, and he has proved that for dense graphs you can’t even approximate modularity to within any constant factor. That is, the best you can do is have an approximation that gets worse as the size of the graph grows. It’s similar to the bad news we had for finding the largest clique, but not as bad. For example, when the graph is sparse it’s known that one can approximate modularity to within a $ \log(n)$ factor of the optimum, where $ n$ is the number of vertices of the graph (for cliques the factor was like $ n^c$ for some $ c$, and this is drastically worse).
Another empirical issue is that modularity seems to fail to find small communities. That is, if your graph has some large communities and some small communities, strictly maximizing the modularity is not the right thing to do. So we’ve seen that even the leading method in the field has some issues.
Something completely different
The last method I want to sketch is in the realm of “something completely different.” The notion is that if we’re given a graph, we can run some experiment on the graph, and the results of that experiment can give us insight into where the communities are.
The experiment I’m going to talk about is the random walk. That is, say you have a vertex $ v$ in a graph $ G$ and you want to find some vertices that are “closest” to $ v$. That is, those that are most likely to be in the same community as $ v$. What you can do is run a random walk starting at $ v$. By a “random walk” I mean you start at $ v$, you pick a neighbor at random and move to it, then repeat. You can compute statistics about the vertices you visit in a sample of such walks, and the vertices that you visit most often are those you say are “in the same community as $ v$. One important parameter is how long the walk is, but it’s generally believed to be best if you keep it between 3-6 steps.
Of course, this is not a partition of the vertices, so it’s not a community detection algorithm, but you can turn it into one. Run this process for each vertex, and use it to compute a “distance” between all the pairs of vertices. Then you compute a tree of partitions by lumping the closest pairs of vertices into the same community, one at a time, until you’ve got every vertex. At each step of the way, you compute the modularity of the partition, and when you’re done you choose the partition that maximizes modularity. This algorithm as a whole is called the walktrap clustering algorithm, and was introduced by Pons and Latapy in 2005.
This sounds like a really great idea, because it’s intuitive: there’s a relatively high chance that the friends of your friends are also your friends. It’s also really great because there is an easily measurable tradeoff between runtime and quality: you can tune down the length of the random walk, and the number of samples you take for each vertex, to speed up the runtime but lower the quality of your statistical estimates. So if you’re working on huge graphs, you get a lot of control and a clear idea of exactly what’s going on inside the algorithm (something which is not immediately clear in a lot of these papers).
Unfortunately, I’m not aware of any concrete theoretical guarantees for walktrap clustering. The one bit of theoretical justification I’ve read over the last year is that you can relate the expected distances you get to certain spectral properties of the graph that are known to be related to community structure, but the lower bounds on maximizing modularity already suggest (though they do not imply) that walktrap won’t do that well in the worst case.
So many algorithms, so little time!
I have only brushed the surface of the literature on community detection, and the things I have discussed are heavily biased toward what I’ve read about and used in my own research. There are methods based on information theory, label propagation, and obscure physics processes like “spin glass” (whatever that is, it sounds frustrating).
And we have only been talking about perfect community structure. What if you want to allow people to be in multiple communities, or have communities at varying levels of granularity (e.g. a sports club within a school versus the whole student body of that school)? What if we want to allow people to be “members” of a community at varying degrees of intensity? How do we deal with noisy signals in our graphs? For example, if we get our data from observing people talk, are two people who have heated arguments considered to be in the same community? Since a lot social network data comes from sources like Twitter and Facebook where arguments are rampant, how do we distinguish between useful and useless data? More subtly, how do we determine useful information if a group within the social network are trying to mask their discovery? That is, how do we deal with adversarial noise in a graph?
And all of this is just on static graphs! What about graphs that change over time? You can keep making the problem more and more complicated as it gets more realistic.
With the huge wealth of research that has already been done just on the simplest case, and the difficult problems and known barriers to success even for the simple problems, it seems almost intimidating to even begin to try to answer these questions. But maybe that’s what makes them fascinating, not to mention that governments and big businesses pour many millions of dollars into this kind of research.
In the future of this blog we plan to derive and implement some of the basic methods of community detection. This includes, as a first outline, the modularity measure and the walktrap clustering algorithm. Considering that I’m also going to spend a large part of the summer thinking about these problems (indeed, with some of the leading researchers and upcoming stars under the sponsorship of the American Mathematical Society), it’s unlikely to end there.
I’m pleased to announce that another paper of mine is finished. This one just got accepted to MFCS 2014, which is being held in Budapest this year (this whole research thing is exciting!). This is joint work with my advisor, Lev Reyzin. As with my first paper, I’d like to explain things here on my blog a bit more informally than a scholarly article allows.
A Recent History of Graph Coloring
One of the first important things you learn when you study graphs is that coloring graphs is hard. Remember that coloring a graph with $ k$ colors means that you assign each vertex a color (a number in $ \left \{ 1, 2, \dots, k \right \}$) so that no vertex is adjacent to a vertex of the same color (no edge is monochromatic). In fact, even deciding whether a graph can be colored with just $ 3$ colors (not to mention finding such a coloring) has no known polynomial time algorithm. It’s what’s called NP-hard, which means that almost everyone believes it’s hopeless to solve efficiently in the worst case.
One might think that there’s some sort of gradient to this problem, that as the graphs get more “complicated” it becomes algorithmically harder to figure out how colorable they are. There are some notions of “simplicity” and “complexity” for graphs, but they hardly fall on a gradient. Just to give the reader an idea, here are some ways to make graph coloring easy:
Make sure your graph is planar. Then deciding 4-colorability is easy because the answer is always yes.
Make sure your graph is triangle-free and planar. Then finding a 3-coloring is easy.
Make sure your graph is perfect (which again requires knowledge about how colorable it is).
Make sure your graph doesn’t have a certain kind of induced subgraph (such as having no induced paths of length 4 or 5).
Let me emphasize that these results are very difficult and tricky to compare. The properties are inherently discrete (either perfect or imperfect, planar or not planar). The fact that the world has not yet agreed upon a universal measure of complexity for graphs (or at least one that makes graph coloring easy to understand) is not a criticism of the chef but a testament to the challenge and intrigue of the dish.
Coloring general graphs is much bleaker, where the focus has turned to approximations. You can’t “approximate” the answer to whether a graph is colorable, so now the key here is that we are actually trying to find an approximate coloring. In particular, if you’re given some graph $ G$ and you don’t know the minimum number of colors needed to color it (say it’s $ \chi(G)$, this is called the chromatic number), can you easily color it with what turns out to be, say, $ 2 \chi(G)$ colors?
Garey and Johnson (the gods of NP-hardness) proved this problem is again hard. In fact, they proved that you can’t do better than twice the number of colors. This might not seem so bad in practice, but the story gets worse. This lower bound was improved by Zuckerman, building on the work of Håstad, to depend on the size of the graph! That is, unless $ P=NP$, all efficient algorithms will use asymptotically more than $ \chi(G) n^{1 – \varepsilon}$ colors for any $ \varepsilon > 0$ in the worst case, where $ n$ is the number of vertices of $ G$. So the best you can hope for is being off by something like a multiplicative factor of $ n / \log n$. You can actually achieve this (it’s nontrivial and takes a lot of work), but it carries that aura of pity for the hopeful graph colorer.
The next avenue is to assume you know the chromatic number of your graph, and see how well you can do then. For example: if you are given the promise that a graph $ G$ is 3-colorable, can you efficiently find a coloring with 8 colors? The best would be if you could find a coloring with 4 colors, but this is already known to be NP-hard.
The best upper bounds, algorithms to find approximate colorings of 3-colorable graphs, also pitifully depend on the size of the graph. Remember I say pitiful not to insult the researchers! This decades-long line of work was extremely difficult and deserves the highest praise. It’s just frustrating that the best known algorithm to color a 3-colorable graph requires as many as $ n^{0.2}$ colors. At least it bypasses the barrier of $ n^{1 – \varepsilon}$ mentioned above, so we know that knowing the chromatic number actually does help.
The lower bounds are a bit more hopeful; it’s known to be NP-hard to color a $ k$-colorable graph using $ 2^{\sqrt[3]{k}}$ colors if $ k$ is sufficiently large. There are a handful of other linear lower bounds that work for all $ k \geq 3$, but to my knowledge this is the best asymptotic result. The big open problem (which I doubt many people have their eye on considering how hard it seems) is to find an upper bound depending only on $ k$. I wonder offhand whether a ridiculous bound like $ k^{k^k}$ colors would be considered progress, and I bet it would.
Our Idea: Resilience
So without big breakthroughs on the front of approximate graph coloring, we propose a new front for investigation. The idea is that we consider graphs which are not only colorable, but remain colorable under the adversarial operation of adding a few new edges. More formally,
Definition: A graph $ G = (V,E)$ is called $ r$-resiliently $ k$-colorable if two properties hold
$ G$ is $ k$-colorable.
For any set $ E’$ of $ r$ edges disjoint from $ E$, the graph $ G’ = (V, E \cup E’)$ is $ k$-colorable.
The simplest nontrivial example of this is 1-resiliently 3-colorable graphs. That is a graph that is 3-colorable and remains 3-colorable no matter which new edge you add. And the question we ask of this example: is there a polynomial time algorithm to 3-color a 1-resiliently 3-colorable graph? We prove in our paper that this is actually NP-hard, but it’s not a trivial thing to see.
The chief benefit of thinking about resiliently colorable graphs is that it provides a clear gradient of complexity from general graphs (zero-resilient) to the empty graph (which is $ (\binom{k+1}{2} – 1)$-resiliently $ k$-colorable). We know that the most complex case is NP-hard, and maximally resilient graphs are trivially colorable. So finding the boundary where resilience makes things easy can shed new light on graph coloring.
Indeed, we argue in the paper that lots of important graphs have stronger resilience properties than one might expect. For example, here are the resilience properties of some famous graphs.
From left to right: the Petersen graph, 2-resiliently 3-colorable; the Dürer graph, 4-resiliently 4-colorable; the Grötzsch graph, 4-resiliently 4-colorable; and the Chvátal graph, 3-resiliently 4-colorable. These are all maximally resilient (no graph is more resilient than stated) and chromatic (no graph is colorable with fewer colors)
If I were of a mind to do applied graph theory, I would love to know about the resilience properties of graphs that occur in the wild. For example, the reader probably knows the problem of register allocation is a natural graph coloring problem. I would love to know the resilience properties of such graphs, with the dream that they might be resilient enough on average to admit efficient coloring algorithms.
Unfortunately the only way that I know how to compute resilience properties is via brute-force search, and of course this only works for small graphs and small $ k$. If readers are interested I could post such a program (I wrote it in vanilla python), but for now I’ll just post a table I computed on the proportion of small graphs that have various levels of resilience (note this includes graphs that vacuously satisfy the definition).
Percentage of k-colorable graphs on 6 vertices which are n-resilient
k\n 1 2 3 4
----------------------------------------
3 58.0 22.7 5.9 1.7
4 93.3 79.3 58.0 35.3
5 99.4 98.1 94.8 89.0
6 100.0 100.0 100.0 100.0
Percentage of k-colorable graphs on 7 vertices which are n-resilient
k\n 1 2 3 4
----------------------------------------
3 38.1 8.2 1.2 0.3
4 86.7 62.6 35.0 14.9
5 98.7 95.6 88.5 76.2
6 99.9 99.7 99.2 98.3
Percentage of k-colorable graphs on 8 vertices which are n-resilient
k\n 1 2 3 4
----------------------------------------
3 21.3 2.1 0.2 0.0
4 77.6 44.2 17.0 4.5
The idea is this: if this trend continues, that only some small fraction of all 3-colorable graphs are, say, 2-resiliently 3-colorable graphs, then it should be easy to color them. Why? Because resilience imposes structure on the graphs, and that structure can hopefully be realized in a way that allows us to color easily. We don’t know how to characterize that structure yet, but we can give some structural implications for sufficiently resilient graphs.
For example, a 7-resiliently 5-colorable graph can’t have any subgraphs on 6 vertices with $ \binom{6}{2} – 7$ edges, or else we can add enough edges to get a 6-clique which isn’t 5-colorable. This gives an obvious general property about the sizes of subgraphs in resilient graphs, but as a more concrete instance let’s think about 2-resilient 3-colorable graphs $ G$. This property says that no set of 4 vertices may have more than $ 4 = \binom{4}{2} – 2$ edges in $ G$. This rules out 4-cycles and non-isolated triangles, but is it enough to make 3-coloring easy? We can say that $ G$ is a triangle-free graph and a bunch of disjoint triangles, but it’s known 3-colorable non-planar triangle-free graphs can have arbitrarily large chromatic number, and so the coloring problem is hard. Moreover, 2-resilience isn’t enough to make $ G$ planar. It’s not hard to construct a non-planar counterexample, but proving it’s 2-resilient is a tedious task I relegated to my computer.
Speaking of which, the problem of how to determine whether a $ k$-colorable graph is $ r$-resiliently $ k$-colorable is open. Is this problem even in NP? It certainly seems not to be, but if it had a nice characterization or even stronger necessary conditions than above, we might be able to use them to find efficient coloring algorithms.
In our paper we begin to fill in a table whose completion would characterize the NP-hardness of coloring resilient graphs
The known complexity of k-coloring r-resiliently k-colorable graphs
Ignoring the technical notion of 2-to-1 hardness, the paper accomplishes this as follows. First, we prove some relationships between cells. In particular, if a cell is NP-hard then so are all the cells to the left and below it. So our Theorem 1, that 3-coloring 1-resiliently 3-colorable graphs is NP-hard, gives us the entire black region, though more trivial arguments give all except the (3,1) cell. Also, if a cell is in P (it’s easy to $ k$-color graphs with that resilience), then so are all cells above and to its right. We prove that $ k$-coloring $ \binom{k}{2}$-resiliently $ k$-colorable graphs is easy. This is trivial: no vertex may have degree greater than $ k-1$, and the greedy algorithm can color such graphs with $ k$ colors. So that gives us the entire light gray region.
There is one additional lower bound comes from the fact that it’s NP-hard to $ 2^{\sqrt[3]{k}}$-color a $ k$-colorable graph. In particular, we prove that if you have any function $ f(k)$ that makes it NP-hard to $ f(k)$-color a $ k$-colorable graph, then it is NP-hard to $ f(k)$-color an $ (f(k) – k)$-resiliently $ f(k)$-colorable graph. The exponential lower bound hence gives us a nice linear lower bound, and so we have the following “sufficiently zoomed out” picture of the table
The zoomed out version of the classification table above.
The paper contains the details of how these observations are proved, in addition to the NP-hardness proof for 1-resiliently 3-colorable graphs. This leaves the following open problems:
Get an unconditional, concrete linear resilience lower bound for hardness.
Find an algorithm that colors graphs that are less resilient than $ O(k^2)$. Even determining specific cells like (4,5) or (5,9) would likely give enough insight for this.
Classify the tantalizing (3,2) cell (determine if it’s hard or easy to 3-color a 2-resiliently 3-colorable graph) or even better the (4,2) cell.
Find a way to relate resilient coloring back to general coloring. For example, if such and such cell is hard, then you can’t approximate k-coloring to within so many colors.
But Wait, There’s More!
Though this paper focuses on graph coloring, our idea of resilience doesn’t stop there (and this is one reason I like it so much!). One can imagine a notion of resilience for almost any combinatorial problem. If you’re trying to satisfy boolean formulas, you can define resilience to mean that you fix the truth value of some variable (we do this in the paper to build up to our main NP-hardness result of 3-coloring 1-resiliently 3-colorable graphs). You can define resilient set cover to allow the removal of some sets. And any other sort of graph-based problem (Traveling salesman, max cut, etc) can be resiliencified by adding or removing edges, whichever makes the problem more constrained.
So this resilience notion is quite general, though it’s hard to define precisely in a general fashion. There is a general framework called Constraint Satisfaction Problems (CSPs), but resilience here seem too general. A CSP is literally just a bunch of objects which can be assigned some set of values, and a set of constraints (k-ary 0-1-valued functions) that need to all be true for the problem to succeed. If we were to define resilience by “adding any constraint” to a given CSP, then there’s nothing to stop us from adding the negation of an existing constraint (or even the tautologically unsatisfiable constraint!). This kind of resilience would be a vacuous definition, and even if we try to rule out these edge cases, I can imagine plenty of weird things that might happen in their stead. That doesn’t mean there isn’t a nice way to generalize resilience to CSPs, but it would probably involve some sort of “constraint class” of acceptable constraints, and I don’t know a reasonable property to impose on the constraint class to make things work.
So there’s lots of room for future work here. It’s exciting to think where it will take me.