# When Greedy Algorithms are Perfect: the Matroid

Greedy algorithms are by far one of the easiest and most well-understood algorithmic techniques. There is a wealth of variations, but at its core the greedy algorithm optimizes something using the natural rule, “pick what looks best” at any step. So a greedy routing algorithm would say to a routing problem: “You want to visit all these locations with minimum travel time? Let’s start by going to the closest one. And from there to the next closest one. And so on.”

Because greedy algorithms are so simple, researchers have naturally made a big effort to understand their performance. Under what conditions will they actually solve the problem we’re trying to solve, or at least get close? In a previous post we gave some easy-to-state conditions under which greedy gives a good approximation, but the obvious question remains: can we characterize when greedy algorithms give an optimal solution to a problem?

The answer is yes, and the framework that enables us to do this is called a matroid. That is, if we can phrase the problem we’re trying to solve as a matroid, then the greedy algorithm is guaranteed to be optimal. Let’s start with an example when greedy is provably optimal: the minimum spanning tree problem. Throughout the article we’ll assume the reader is familiar with the very basics of linear algebra and graph theory (though we’ll remind ourselves what a minimum spanning tree is shortly). For a refresher, this blog has primers on both subjects. But first, some history.

## History

Matroids were first introduced by Hassler Whitney in 1935, and independently discovered a little later by B.L. van der Waerden (a big name in combinatorics). They were both interested in devising a general description of “independence,” the properties of which are strikingly similar when specified in linear algebra and graph theory. Since then the study of matroids has blossomed into a large and beautiful theory, one part of which is the characterization of the greedy algorithm: greedy is optimal on a problem if and only if the problem can be represented as a matroid. Mathematicians have also characterized which matroids can be modeled as spanning trees of graphs (we will see this momentarily). As such, matroids have become a standard topic in the theory and practice of algorithms.

## Minimum Spanning Trees

It is often natural in an undirected graph $G = (V,E)$ to find a connected subset of edges that touch every vertex. As an example, if you’re working on a power network you might want to identify a “backbone” of the network so that you can use the backbone to cheaply travel from any node to any other node. Similarly, in a routing network (like the internet) it costs a lot of money to lay down cable, it’s in the interest of the internet service providers to design analogous backbones into their infrastructure.

A minimal subset of edges in a backbone like this is guaranteed to form a tree. This is simply because if you have a cycle in your subgraph then removing any edge on that cycle doesn’t break connectivity or the fact that you can get from any vertex to any other (and trees are the maximal subgraphs without cycles). As such, these “backbones” are called spanning trees. “Span” here means that you can get from any vertex to any other vertex, and it suggests the connection to linear algebra that we’ll describe later, and it’s a simple property of a tree that there is a unique path between any two vertices in the tree.

An example of a spanning tree

When your edges $e \in E$ have nonnegative weights $w_e \in \mathbb{R}^{\geq 0}$, we can further ask to find a minimum cost spanning tree. The cost of a spanning tree $T$ is just the sum of its edges, and it’s important enough of a definition to offset.

Definition: A minimum spanning tree $T$ of a weighted graph $G$ (with weights $w_e \geq 0$ for $e \in E$) is a spanning tree which minimizes the quantity

$w(T) = \sum_{e \in T} w_e$

There are a lot of algorithms to find minimal spanning trees, but one that will lead us to matroids is Kruskal’s algorithm. It’s quite simple. We’ll maintain a forest $F$ in $G$, which is just a subgraph consisting of a bunch of trees that may or may not be connected. At the beginning $F$ is just all the vertices with no edges. And then at each step we add to $F$ the edge $e$ whose weight is smallest and also does not introduce any cycles into $F$. If the input graph $G$ is connected then this will always produce a minimal spanning tree.

Theorem: Kruskal’s algorithm produces a minimal spanning tree of a connected graph.

Proof. Call $F_t$ the forest produced at step $t$ of the algorithm. Then $F_0$ is the set of all vertices of $G$ and $F_{n-1}$ is the final forest output by Kruskal’s (as a quick exercise, prove all spanning trees on $n$ vertices have $n-1$ edges, so we will stop after $n-1$ rounds). It’s clear that $F_{n-1}$ is a tree because the algorithm guarantees no $F_i$ will have a cycle. And any tree with $n-1$ edges is necessarily a spanning tree, because if some vertex were left out then there would be $n-1$ edges on a subgraph of $n-1$ vertices, necessarily causing a cycle somewhere in that subgraph.

Now we’ll prove that $F_{n-1}$ has minimal cost. We’ll prove this in a similar manner to the general proof for matroids. Indeed, say you had a tree $T$ whose cost is strictly less than that of $F_{n-1}$ (we can also suppose that $T$ is minimal, but this is not necessary). Pick the minimal weight edge $e \in T$ that is not in $F_{n-1}$. Adding $e$ to $F_{n-1}$ introduces a unique cycle $C$ in $F_{n-1}$. This cycle has some strange properties. First, $e$ has the highest cost of any edge on $C$. For otherwise, Kruskal’s algorithm would have chosen it before the heavier weight edges. Second, there is another edge in $C$ that’s not in $T$ (because $T$ was a tree it can’t have the entire cycle). Call such an edge $e'$. Now we can remove $e'$ from $F_{n-1}$ and add $e$. This can only increase the total cost of $F_{n-1}$, but this transformation produces a tree with one more edge in common with $T$ than before. This contradicts that $T$ had strictly lower weight than $F_{n-1}$, because repeating the process we described would eventually transform $F_{n-1}$ into $T$ exactly, while only increasing the total cost.

$\square$

Just to recap, we defined sets of edges to be “good” if they did not contain a cycle, and a spanning tree is a maximal set of edges with this property. In this scenario, the greedy algorithm performed optimally at finding a spanning tree with minimal total cost.

## Columns of Matrices

Now let’s consider a different kind of problem. Say I give you a matrix like this one:

$\displaystyle A = \begin{pmatrix} 2 & 0 & 1 & -1 & 0 \\ 0 & -4 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 7 \end{pmatrix}$

In the standard interpretation of linear algebra, this matrix represents a linear function $f$ from one vector space $V$ to another $W$, with the basis $(v_1, \dots, v_5)$ of $V$ being represented by columns and the basis $(w_1, w_2, w_3)$ of $W$ being represented by the rows. Column $j$ tells you how to write $f(v_j)$ as a linear combination of the $w_i$, and in so doing uniquely defines $f$.

Now one thing we want to calculate is the rank of this matrix. That is, what is the dimension of the image of $V$ under $f$? By linear algebraic arguments we know that this is equivalent to asking “how many linearly independent columns of $A$ can we find”? An interesting consequence is that if you have two sets of columns that are both linearly independent and maximally so (adding any other column to either set would necessarily introduce a dependence in that set), then these two sets have the same size. This is part of why the rank of a matrix is well-defined.

If we were to give the columns of $A$ costs, then we could ask about finding the minimal-cost maximally-independent column set. It sounds like a mouthful, but it’s exactly the same idea as with spanning trees: we want a set of vectors that spans the whole column space of $A$, but contains no “cycles” (linearly dependent combinations), and we want the cheapest such set.

So we have two kinds of “independence systems” that seem to be related. One interesting question we can ask is whether these kinds of independence systems are “the same” in a reasonable way. Hardcore readers of this blog may see the connection quite quickly. For any graph $G = (V,E)$, there is a natural linear map from $E$ to $V$, so that a linear dependence among the columns (edges) corresponds to a cycle in $G$. This map is called the incidence matrix by combinatorialists and the first boundary map by topologists.

The map is easy to construct: for each edge $e = (v_i,v_j)$ you add a column with a 1 in the $j$-th row and a $-1$ in the $i$-th row. Then taking a sum of edges gives you zero if and only if the edges form a cycle. So we can think of a set of edges as “independent” if they don’t contain a cycle. It’s a little bit less general than independence over $\mathbb{R}$, but you can make it exactly the same kind of independence if you change your field from real numbers to $\mathbb{Z}/2\mathbb{Z}$. We won’t do this because it will detract from our end goal (to analyze greedy algorithms in realistic settings), but for further reading this survey of Oxley assumes that perspective.

So with the recognition of how similar these notions of independence are, we are ready to define matroids.

## The Matroid

So far we’ve seen two kinds of independence: “sets of edges with no cycles” (also called forests) and “sets of linearly independent vectors.” Both of these share two trivial properties: there are always nonempty independent sets, and every subset of an independent set is independent. We will call any family of subsets with this property an independence system.

Definition: Let $X$ be a finite set. An independence system over $X$ is a family $\mathscr{I}$ of subsets of $X$ with the following two properties.

1. $\mathscr{I}$ is nonempty.
2. If $I \in \mathscr{I}$, then so is every subset of $I$.

This is too general to characterize greedy algorithms, so we need one more property shared by our examples. There are a few things we do, but here’s one nice property that turns out to be enough.

Definition: A matroid $M = (X, \mathscr{I})$ is a set $X$ and an independence system $\mathscr{I}$ over $X$ with the following property:

If $A, B$ are in $\mathscr{I}$ with $|A| = |B| + 1$, then there is an element $x \in A \setminus B$ such that $B \cup \{ a \} \in \mathscr{I}$.

In other words, this property says if I have an independent set that is not maximally independent, I can grow the set by adding some suitably-chosen element from a larger independent set. We’ll call this the extension property. For a warmup exercise, let’s prove that the extension property is equivalent to the following (assuming the other properties of a matroid):

For every subset $Y \subset X$, all maximal independent sets contained in $Y$ have equal size.

Proof. For one direction, if you have two maximal sets $A, B \subset Y \subset X$ that are not the same size (say $A$ is bigger), then you can take any subset of $A$ whose size is exactly $|B| + 1$, and use the extension property to make $B$ larger, a contradiction. For the other direction, say that I know all maximal independent sets of any $Y \subset X$ have the same size, and you give me $A, B \subset X$. I need to find an $a \in A \setminus B$ that I can add to $B$ and keep it independent. What I do is take the subset $Y = A \cup B$. Now the sizes of $A, B$ don’t change, but $B$ can’t be maximal inside $Y$ because it’s smaller than $A$ ($A$ might not be maximal either, but it’s still independent). And the only way to extend $B$ is by adding something from $A$, as desired.

$\square$

So we can use the extension property and the cardinality property interchangeably when talking about matroids. Continuing to connect matroid language to linear algebra and graph theory, the maximal independent sets of a matroid are called bases, the size of any basis is the rank of the matroid, and the minimal dependent sets are called circuits. In fact, you can characterize matroids in terms of the properties of their circuits, which are dual to the properties of bases (and hence all independent sets) in a very concrete sense.

But while you could spend all day characterizing the many kinds of matroids and comatroids out there, we are still faced with the task of seeing how the greedy algorithm performs on a matroid. That is, suppose that your matroid $M = (X, \mathscr{I})$ has a nonnegative real number $w(x)$ associated with each $x \in X$. And suppose we had a black-box function to determine if a given set $S \subset X$ is independent. Then the greedy algorithm maintains a set $B$, and at every step adds a minimum weight element that maintains the independence of $B$. If we measure the cost of a subset by the sum of the weights of its elements, then the question is whether the greedy algorithm finds a minimum weight basis of the matroid.

The answer is even better than yes. In fact, the answer is that the greedy algorithm performs perfectly if and only if the problem is a matroid! More rigorously,

Theorem: Suppose that $M = (X, \mathscr{I})$ is an independence system, and that we have a black-box algorithm to determine whether a given set is independent. Define the greedy algorithm to iteratively adds the cheapest element of $X$ that maintains independence. Then the greedy algorithm produces a maximally independent set $S$ of minimal cost for every nonnegative cost function on $X$, if and only if $M$ is a matroid.

It’s clear that the algorithm will produce a set that is maximally independent. The only question is whether what it produces has minimum weight among all maximally independent sets. We’ll break the theorem into the two directions of the “if and only if”:

Part 1: If $M$ is a matroid, then greedy works perfectly no matter the cost function.
Part 2: If greedy works perfectly for every cost function, then $M$ is a matroid.

Proof of Part 1.

Call the cost function $w : X \to \mathbb{R}^{\geq 0}$, and suppose that the greedy algorithm picks elements $B = \{ x_1, x_2, \dots, x_r \}$ (in that order). It’s easy to see that $w(x_1) \leq w(x_2) \leq \dots \leq w(x_r)$. Now if you give me any list of $r$ independent elements $y_1, y_2, \dots, y_r \in X$ that has $w(y_1) \leq \dots \leq w(y_r)$, I claim that $w(x_i) \leq w(y_i)$ for all $i$. This proves what we want, because if there were a basis of size $r$ with smaller weight, sorting its elements by weight would give a list contradicting this claim.

To prove the claim, suppose to the contrary that it were false, and for some $k$ we have $w(x_k) > w(y_k)$. Moreover, pick the smallest $k$ for which this is true. Note $k > 1$, and so we can look at the special sets $S = \{ x_1, \dots, x_{k-1} \}$ and $T = \{ y_1, \dots, y_k \}$. Now $|T| = |S|+1$, so by the matroid property there is some $j$ between $1$ and $r$ so that $S \cup \{ y_j \}$ is an independent set (and $y_j$ is not in $S$). But then $w(y_j) \leq w(y_k) < w(x_k)$, and so the greedy algorithm would have picked $y_j$ before it picks $x_k$ (and the strict inequality means they’re different elements). This contradicts how the greedy algorithm runs, and hence proves the claim.

Proof of Part 2.

We’ll prove this contrapositively as follows. Suppose we have our independence system and it doesn’t satisfy the last matroid condition. Then we’ll construct a special weight function that causes the greedy algorithm to fail. So let $A,B$ be independent sets with $|A| = |B| + 1$, but for every $a \in A \setminus B$ adding $a$ to $B$ never gives you an independent set.

Now what we’ll do is define our weight function so that the greedy algorithm picks the elements we want in the order we want (roughly). In particular, we’ll assign all elements of $A \cap B$ a tiny weight we’ll call $w_1$. For elements of $B - A$ we’ll use $w_2$, and for $A - B$ we’ll use $w_3$, with $w_4$ for everything else. In a more compact notation:

We need two things for this weight function to screw up the greedy algorithm. The first is that $w_1 < w_2 < w_3 < w_4$, so that greedy picks the elements in the order we want. Note that this means it’ll first pick all of $A \cap B$, and then all of $B - A$, and by assumption it won’t be able to pick anything from $A - B$, but since $B$ is assumed to be non-maximal, we have to pick at least one element from $X - (A \cup B)$ and pay $w_4$ for it.

So the second thing we want is that the cost of doing greedy is worse than picking any maximally independent set that contains $A$ (and we know that there has to be some maximal independent set containing $A$). In other words, if we call $m$ the size of a maximally independent set, we want

$\displaystyle |A \cap B| w_1 + |B-A|w_2 + (m - |B|)w_4 > |A \cap B|w_1 + |A-B|w_3 + (m-|A|)w_4$

This can be rearranged (using the fact that $|A| = |B|+1$) to

$\displaystyle w_4 > |A-B|w_3 - |B-A|w_2$

The point here is that the greedy picks too many elements of weight $w_4$, since if we were to start by taking all of $A$ (instead of all of $B$), then we could get by with one fewer. That might not be optimal, but it’s better than greedy and that’s enough for the proof.

So we just need to make $w_4$ large enough to make this inequality hold, while still maintaining $w_2 < w_3$. There are probably many ways to do this, and here’s one. Pick some $0 < \varepsilon < 1$, and set

It’s trivial that $w_1 < w_2$ and $w_3 < w_4$. For the rest we need some observations. First, the fact that $|A-B| = |B-A| + 1$ implies that $w_2 < w_3$. Second, both $|A-B|$ and $|B-A|$ are nonempty, since otherwise the second property of independence systems would contradict our assumption that augmenting $B$ with elements of $A$ breaks independence. Using this, we can divide by these quantities to get

$\displaystyle w_4 = 2 > 1 = \frac{|A-B|(1 + \varepsilon)}{|A-B|} - \frac{|B-A|\varepsilon}{|B-A|}$

This proves the claim and finishes the proof.

$\square$

As a side note, we proved everything here with respect to minimizing the sum of the weights, but one can prove an identical theorem for maximization. The only part that’s really different is picking the clever weight function in part 2. In fact, you can convert between the two by defining a new weight function that subtracts the old weights from some fixed number $N$ that is larger than any of the original weights. So these two problems really are the same thing.

This is pretty amazing! So if you can prove your problem is a matroid then you have an awesome algorithm automatically. And if you run the greedy algorithm for fun and it seems like it works all the time, then that may be hinting that your problem is a matroid. This is one of the best situations one could possibly hope for.

But as usual, there are a few caveats to consider. They are both related to efficiency. The first is the black box algorithm for determining if a set is independent. In a problem like minimum spanning tree or finding independent columns of a matrix, there are polynomial time algorithms for determining independence. These two can both be done, for example, with Gaussian elimination. But there’s nothing to stop our favorite matroid from requiring an exponential amount of time to check if a set is independent. This makes greedy all but useless, since we need to check for independence many times in every round.

Another, perhaps subtler, issue is that the size of the ground set $X$ might be exponentially larger than the rank of the matroid. In other words, at every step our greedy algorithm needs to find a new element to add to the set it’s building up. But there could be such a huge ocean of candidates, all but a few of which break independence. In practice an algorithm might be working with $X$ implicitly, so we could still hope to solve the problem if we had enough knowledge to speed up the search for a new element.

There are still other concerns. For example, a naive approach to implementing greedy takes quadratic time, since you may have to look through every element of $X$ to find the minimum-cost guy to add. What if you just have to have faster runtime than $O(n^2)$? You can still be interested in finding more efficient algorithms that still perform perfectly, and to the best of my knowledge there’s nothing that says that greedy is the only exact algorithm for your favorite matroid. And then there are models where you don’t have direct/random access to the input, and lots of other ways that you can improve on greedy. But those stories are for another time.

Until then!

# When Greedy Algorithms are Good Enough: Submodularity and the (1 – 1/e)-Approximation

Greedy algorithms are among the simplest and most intuitive algorithms known to humans. Their name essentially gives their description: do the thing that looks best right now, and repeat until nothing looks good anymore or you’re forced to stop. Some of the best situations in computer science are also when greedy algorithms are optimal or near-optimal. There is a beautiful theory of this situation, known as the theory of matroids. We haven’t covered matroids on this blog (at some point we will), but in this post we will focus on the next best thing: when the greedy algorithm guarantees a reasonably good approximation to the optimal solution.

This situation isn’t hard to formalize, and we’ll make it as abstract as possible. Say you have a set of objects $X$, and you’re looking to find the “best” subset $S \subset X$. Here “best” is just measured by a fixed (known, efficiently computable) objective function $f : 2^X \to \mathbb{R}$. That is, $f$ accepts as input subsets of $X$ and outputs numbers so that better subsets have larger numbers. Then the goal is to find a subset maximizing $X$.

In this generality the problem is clearly impossible. You’d have to check all subsets to be sure you didn’t miss the best one. So what conditions do we need on either $X$ or $f$ or both that makes this problem tractable? There are plenty you could try, but one very rich property is submodularity.

## The Submodularity Condition

I think the simplest way to explain submodularity is in terms of coverage. Say you’re starting a new radio show and you have to choose which radio stations to broadcast from to reach the largest number of listeners. For simplicity say each radio station has one tower it broadcasts from, and you have a good estimate of the number of listeners you would reach if you broadcast from a given tower. For more simplicity, say it costs the same to broadcast from each tower, and your budget restricts you to a maximum of ten stations to broadcast from. So the question is: how do you pick towers to maximize your overall reach?

The hidden condition here is that some towers overlap in which listeners they reach. So if you broadcast from two towers in the same city, a listener who has access to both will just pick one or the other. In other words, there’s a diminished benefit to picking two overlapping towers if you already have chosen one.

In our version of the problem, picking both of these towers has some small amount of “overkill.”

This “diminishing returns” condition is a general idea you can impose on any function that takes in subsets of a given set and produces numbers. If $X$ is a set then for what seems like a strange reason we denote the set of all subsets of $X$ by $2^X$. So we can state this condition more formally,

Definition: Let $X$ be a finite set. A function $f: 2^X \to \mathbb{R}$ is called submodular if for all subsets $S \subset T \subset X$ and all $x \in X \setminus T$,

$\displaystyle f(S \cup \{ x \}) - f(S) \geq f(T \cup \{ x \}) - f(T)$

In other words, if $f$ measures “benefit,” then the marginal benefit of adding $x$ to $S$ is at least as high as the marginal benefit of adding it to $T$. Since $S \subset T$ and $x$ are all arbitrary, this is as general as one could possibly make it.

Before we start doing things with submodular functions, let’s explore some basic properties. The first is an equivalent definition of submodularity

Proposition: $f$ is submodular if and only if for all $A, B \subset X$, it holds that

$\displaystyle f(A \cap B) + f(A \cup B) \leq f(A) + f(B)$.

Proof. If we assume $f$ has the condition from this proposition, then we can set $A=T, B=S \cup \{ x \}$, and the formula just works out. Conversely, if we have the condition from the definition, then using the fact that $A \cap B \subset B$ we can inductively apply the inequality to each element of $A \setminus B$ to get

$\displaystyle f(A \cup B) - f(B) \leq f(A) - f(A \cap B)$

$\square$

Next, we can tweak and combine submodular functions to get more submodular functions. In particular, non-negative linear combinations of sub-modular functions are submodular. In other words, if $f_1, \dots, f_k$ are submodular on the same set $X$, and $\alpha_1, \dots, \alpha_k$ are all non-negative reals, then $\alpha_1 f_1 + \dots + \alpha_k f_k$ is also a submodular function on $X$. It’s an easy exercise in applying the definition to see why this is true. This is important because when we’re designing objectives to maximize, we can design them by making some simple submodular pieces, and then picking an appropriate combination of those pieces.

The second property we need to impose on a submodular function is monotonicity. That is, as your sets get more elements added to them, their value under $f$ only goes up. In other words, $f$ is monotone when $S \subset T$ then $f(S) \leq f(T)$. An interesting property of functions that are both submodular and monotone is that the truncation of such a function is also submodular and monotone. In other words, $\textup{min}(f(S), c)$ is still submodular when $f$ is monotone submodular and $c$ is a constant.

## Submodularity and Monotonicity Give 1 – 1/e

The wonderful thing about submodular functions is that we have a lot of great algorithmic guarantees for working with them. We’ll prove right now that the coverage problem (while it might be hard to solve in general) can be approximated pretty well by the greedy algorithm.

Here’s the algorithmic setup. I give you a finite set $X$ and an efficient black-box to evaluate $f(S)$ for any subset $S \subset X$ you want. I promise you that $f$ is monotone and submodular. Now I give you an integer $k$ between 1 and the size of $X$, and your task is to quickly find a set $S$ of size $k$ for which $f(S)$ is maximal among all subsets of size $k$. That is, you design an algorithm that will work for any $k, X, f$ and runs in polynomial time in the sizes of $X, k$.

In general this problem is NP-hard, meaning you’re not going to find a solution that works in the worst case (if you do, don’t call me; just claim your million dollar prize). So how well can we approximate the optimal value for $f(S)$ by a different set of size $k$? The beauty is that, if your function is monotone and submodular, you can guarantee to get within 63% of the optimum. The hope (and reality) is that in practice it will often perform much better, but still this is pretty good! More formally,

Theorem: Let $f$ be a monotone, submodular, non-negative function on $X$. The greedy algorithm, which starts with $S$ as the empty set and at every step picks an element $x$ which maximizes the marginal benefit $f(S \cup \{ x \}) - f(S)$, provides a set $S$ that achieves a $(1- 1/e)$-approximation of the optimum.

We’ll prove this in just a little bit more generality, and the generality is quite useful. If we call $S_1, S_2, \dots, S_l$ the sets chosen by the greedy algorithm (where now we might run the greedy algorithm for $l > k$ steps), then for all $l, k$, we have

$\displaystyle f(S_l) \geq \left ( 1 - e^{-l/k} \right ) \max_{T: |T| \leq k} f(T)$

This allows us to run the algorithm for more than $k$ steps to get a better approximation by sets of larger size, and quantify how much better the guarantee on that approximation would be. It’s like an algorithmic way of hedging your risk. So let’s prove it.

Proof. Let’s set up some notation first. Fix your $l$ and $k$, call $S_i$ the set chosen by the greedy algorithm at step $i$, and call $S^*$ the optimal subset of size $k$. Further call $\textup{OPT}$ the value of the best set $f(S^*)$. Call $x_1^*, \dots, x_k^*$ the elements of $S^*$ (the order is irrelevant). Now for every $i < l$ monotonicity gives us $f(S^*) \leq f(S^* \cup S_i)$. We can unravel this into a sum of marginal gains of adding single elements. The first step is

$\displaystyle f(S^* \cup S_i) = f(S^* \cup S_i) - f(\{ x_1^*, \dots, x_{k-1}^* \} \cup S_i) + f(\{ x_1^*, \dots, x_{k-1}^* \} \cup S_i)$

The second step removes $x_{k-1}^*$, from the last term, the third removes $x_{k-2}^*$, and so on until we have removed all of $S^*$ and get this sum

$\displaystyle f(S^* \cup S_i) = f(S_i) + \sum_{j=1}^k \left ( f(S_i \cup \{ x_1^*, \dots, x_j^* \}) - f(S_i \cup \{ x_1^*, \dots, x_{j-1}^* \} ) \right )$

Now, applying submodularity, we can change all of these marginal benefits of “adding one more $S^*$ element to $S_i$ already with some $S^*$ stuff” to “adding one more $S^*$ element to just $S_i$.” In symbols, the equation above is at most

$\displaystyle f(S_i) + \sum_{x \in S^*} f(S_i \cup \{ x \}) - f(S_i)$

and because $S_{i+1}$ is greedily chosen to maximize the benefit of adding a single element, so the above is at most

$\displaystyle f(S_i) + \sum_{x \in S^*} f(S_{i+1}) - f(S_i) = f(S_i) + k(f(S_{i+1}) - f(S_i))$

Chaining all of these together, we have $f(S^*) - f(S_i) \leq k(f(S_{i+1}) - f(S_i))$. If we call $a_{i} = f(S^*) - f(S_i)$, then this inequality can be rewritten as $a_{i+1} \leq (1 - 1/k) a_{i}$. Now by induction we can relate $a_l \leq (1 - 1/k)^l a_0$. Now use the fact that $a_0 \leq f(S^*)$ and the common inequality $1-x \leq e^{-x}$ to get

$\displaystyle a_l = f(S^*) - f(S_l) \leq e^{-l/k} f(S^*)$

And rearranging gives $f(S_l) \geq (1 - e^{-l/k}) f(S^*)$.

$\square$

Setting $l=k$ gives the approximation bound we promised. But note that allowing the greedy algorithm to run longer can give much stronger guarantees, though it requires you to sacrifice the cardinality constraint. $1 - 1/e$ is about 63%, but doubling the size of $S$ gives about an 86% approximation guarantee. This is great for people in the real world, because you can quantify the gains you’d get by relaxing the constraints imposed on you (which are rarely set in stone).

So this is really great! We have quantifiable guarantees on a stupidly simple algorithm, and the setting is super general. And so if you have your problem and you manage to prove your function is submodular (this is often the hardest part), then you are likely to get this nice guarantee.

## Extensions and Variations

This result on monotone submodular functions is just one part of a vast literature on finding approximation algorithms for submodular functions in various settings. In closing this post we’ll survey some of the highlights and provide references.

What we did in this post was maximize a monotone submodular function subject to a cardinality constraint $|S| \leq k$. There are three basic variations we could do: we could drop constraints and see whether we can still get guarantees, we could look at minimization instead of maximization, and we could modify the kinds of constraints we impose on the solution.

There are a ton of different kinds of constraints, and we’ll discuss two. The first is where you need to get a certain value $f(S) \geq q$, and you want to find the smallest set that achieves this value. Laurence Wolsey (who proved a lot of these theorems) showed in 1982 that a slight variant of the greedy algorithm can achieve a set whose size is a multiplicative factor of $1 + \log (\max_x f(\{ x \}))$ worse than the optimum.

The second kind of constraint is a generalization of a cardinality constraint called a knapsack constraint. This means that each item $x \in X$ has a cost, and you have a finite budget with which to spend on elements you add to $S$. One might expect this natural extension of the greedy algorithm to work: pick the element which maximizes the ratio of increasing the value of $f$ to the cost (within your available budget). Unfortunately this algorithm can perform arbitrarily poorly, but there are two fun caveats. The first is that if you do both this augmented greedy algorithm and the greedy algorithm that ignores costs, then at least one of these can’t do too poorly. Specifically, one of them has to get at least a 30% approximation. This was shown by Leskovec et al in 2007. The second is that if you’re willing to spend more time in your greedy step by choosing the best subset of size 3, then you can get back to the $1-1/e$ approximation. This was shown by Sviridenko in 2004.

Now we could try dropping the monotonicity constraint. In this setting cardinality constraints are also superfluous, because it could be that the very large sets have low values. Now it turns out that if $f$ has no other restrictions (in particular, if it’s allowed to be negative), then even telling whether there’s a set $S$ with $f(S) > 0$ is NP-hard, but the optimum could be arbitrarily large and positive when it exists. But if you require that $f$ is non-negative, then you can get a 1/3-approximation, if you’re willing to add randomness you can get 2/5 in expectation, and with more subtle constraints you can get up to a 1/2 approximation. Anything better is NP-hard. Fiege, Mirrokni, and Vondrak have a nice FOCS paper on this.

Next, we could remove the monotonicity property and try to minimize the value of $f(S)$. It turns out that this problem always has an efficient solution, but the only algorithm I have heard of to solve it involves a very sophisticated technique called the ellipsoid algorithm. This is heavily related to linear programming and convex optimization, something which I hope to cover in more detail on this blog.

Finally, there are many interesting variations in the algorithmic procedure. For example, one could require that the elements are provided in some order (the streaming setting), and you have to pick at each step whether to put the element in your set or not. Alternatively, the objective functions might not be known ahead of time and you have to try to pick elements to jointly maximize them as they are revealed. These two settings have connections to bandit learning problems, which we’ve covered before on this blog. See this survey of Krause and Golovin for more on the connections, which also contains the main proof used in this post.

Indeed, despite the fact that many of the big results were proved in the 80’s, the analysis of submodular functions is still a big research topic. There was even a paper posted just the other day on the arXiv about it’s relation to ad serving! And wouldn’t you know, they proved a $(1-1/e)$-approximation for their setting. There’s just something about $1-1/e$.

Until next time!

# Linear Programming and the Most Affordable Healthy Diet — Part 1

Optimization is by far one of the richest ways to apply computer science and mathematics to the real world. Everybody is looking to optimize something: companies want to maximize profits, factories want to maximize efficiency, investors want to minimize risk, the list just goes on and on. The mathematical tools for optimization are also some of the richest mathematical techniques. They form the cornerstone of an entire industry known as operations research, and advances in this field literally change the world.

The mathematical field is called combinatorial optimization, and the name comes from the goal of finding optimal solutions more efficiently than an exhaustive search through every possibility. This post will introduce the most central problem in all of combinatorial optimization, known as the linear program. Even better, we know how to efficiently solve linear programs, so in future posts we’ll write a program that computes the most affordable diet while meeting the recommended health standard.

## Generalizing a Specific Linear Program

Most optimization problems have two parts: an objective function, the thing we want to maximize or minimize, and constraints, rules we must abide by to ensure we get a valid solution. As a simple example you may want to minimize the amount of time you spend doing your taxes (objective function), but you certainly can’t spend a negative amount of time on them (a constraint).

The following more complicated example is the centerpiece of this post. Most people want to minimize the amount of money spent on food. At the same time, one needs to maintain a certain level of nutrition. For males ages 19-30, the United States National Institute for Health recommends 3.7 liters of water per day, 1,000 milligrams of calcium per day, 90 milligrams of vitamin C per day, etc.

We can set up this nutrition problem mathematically, just using a few toy variables. Say we had the option to buy some combination of oranges, milk, and broccoli. Some rough estimates [1] give the following content/costs of these foods. For 0.272 USD you can get 100 grams of orange, containing a total of 53.2mg of calcium, 40mg of vitamin C, and 87g of water. For 0.100 USD you can get 100 grams of whole milk, containing 276mg of calcium, 0mg of vitamin C, and 87g of water. Finally, for 0.381 USD you can get 100 grams of broccoli containing 47mg of calcium, 89.2mg of vitamin C, and 91g of water. Here’s a table summarizing this information:

Nutritional content and prices for 100g of three foods

Food         calcium(mg)     vitamin C(mg)      water(g)   price(USD/100g)
Broccoli     47              89.2               91         0.381
Whole milk   276             0                  87         0.100
Oranges      40              53.2               87         0.272

Some observations: broccoli is more expensive but gets the most of all three nutrients, whole milk doesn’t have any vitamin C but gets a ton of calcium for really cheap, and oranges are a somewhere in between. So you could probably tinker with the quantities and figure out what the cheapest healthy diet is. The problem is what happens when we incorporate hundreds or thousands of food items and tens of nutrient recommendations. This simple example is just to help us build up a nice formality.

So let’s continue doing that. If we denote by $b$ the number of 100g units of broccoli we decide to buy, and $m$ the amount of milk and $r$ the amount of oranges, then we can write the daily cost of food as

$\displaystyle \text{cost}(b,m,r) = 0.381 b + 0.1 m + 0.272 r$

In the interest of being compact (and again, building toward the general linear programming formulation) we can extract the price information into a single cost vector $c = (0.381, 0.1, 0.272)$, and likewise write our variables as a vector $x = (b,m,r)$. We’re implicitly fixing an ordering on the variables that is maintained throughout the problem, but the choice of ordering doesn’t matter. Now the cost function is just the inner product (dot product) of the cost vector and the variable vector $\left \langle c,x \right \rangle$. For some reason lots of people like to write this as $c^Tx$, where $c^T$ denotes the transpose of a matrix, and we imagine that $c$ and $x$ are matrices of size $3 \times 1$. I’ll stick to using the inner product bracket notation.

Now for each type of food we get a specific amount of each nutrient, and the sum of those nutrients needs to be bigger than the minimum recommendation. For example, we want at least 1,000 mg of calcium per day, so we require that $1000 \leq 47b + 276m + 40r$. Likewise, we can write out a table of the constraints by looking at the columns of our table above.

$\displaystyle \begin{matrix} 91b & + & 87m & + & 87r & \geq & 3700 & \text{(water)}\\ 47b & + & 276m & + & 40r & \geq & 1000 & \text{(calcium)} \\ 89.2b & + & 0m & + & 53.2r & \geq & 90 & \text{(vitamin C)} \end{matrix}$

In the same way that we extracted the cost data into a vector to separate it from the variables, we can extract all of the nutrient data into a matrix $A$, and the recommended minimums into a vector $v$. Traditionally the letter $b$ is used for the minimums vector, but for now we’re using $b$ for broccoli.

$A = \begin{pmatrix} 91 & 87 & 87 \\ 47 & 276 & 40 \\ 89.2 & 0 & 53.2 \end{pmatrix}$

$v = \begin{pmatrix} 3700 \\ 1000 \\ 90 \end{pmatrix}$

And now the constraint is that $Ax \geq v$, where the $\geq$ means “greater than or equal to in every coordinate.” So now we can write down the more general form of the problem for our specific matrices and vectors. That is, our problem is to minimize $\left \langle c,x \right \rangle$ subject to the constraint that $Ax \geq v$. This is often written in offset form to contrast it with variations we’ll see in a bit:

$\displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to the constraint } Ax \geq v$

In general there’s no reason you can’t have a “negative” amount of one variable. In this problem you can’t buy negative broccoli, so we’ll add the constraints to ensure the variables are nonnegative. So our final form is

$\displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to } Ax \geq v \\ \text{and } x \geq 0$

In general, if you have an $m \times n$ matrix $A$, a “minimums” vector $v \in \mathbb{R}^m$, and a cost vector $c \in \mathbb{R}^n$, the problem of finding the vector $x$ that minimizes the cost function while meeting the constraints is called a linear programming problem or simply a linear program.

To satiate the reader’s burning curiosity, the solution for our calcium/vitamin C problem is roughly $x = (1.01, 41.47, 0)$. That is, you should have about 100g of broccoli and 4.2kg of milk (like 4 liters), and skip the oranges entirely. The daily cost is about 4.53 USD. If this seems awkwardly large, it’s because there are cheaper ways to get water than milk.

100g of broccoli (image source: 100-grams.blogspot.com)

## Duality

Now that we’ve seen the general form a linear program and a cute example, we can ask the real meaty question: is there an efficient algorithm that solves arbitrary linear programs? Despite how widely applicable these problems seem, the answer is yes!

But before we can describe the algorithm we need to know more about linear programs. For example, say you have some vector $x$ which satisfies your constraints. How can you tell if it’s optimal? Without such a test we’d have no way to know when to terminate our algorithm. Another problem is that we’ve phrased the problem in terms of minimization, but what about problems where we want to maximize things? Can we use the same algorithm that finds minima to find maxima as well?

Both of these problems are neatly answered by the theory of duality. In mathematics in general, the best way to understand what people mean by “duality” is that one mathematical object uniquely determines two different perspectives, each useful in its own way. And typically a duality theorem provides one with an efficient way to transform one perspective into the other, and relate the information you get from both perspectives. A theory of duality is considered beautiful because it gives you truly deep insight into the mathematical object you care about.

In linear programming duality is between maximization and minimization. In particular, every maximization problem has a unique “dual” minimization problem, and vice versa. The really interesting thing is that the variables you’re trying to optimize in one form correspond to the contraints in the other form! Here’s how one might discover such a beautiful correspondence. We’ll use a made up example with small numbers to make things easy.

So you have this optimization problem

$\displaystyle \begin{matrix} \text{minimize} & 4x_1+3x_2+9x_3 & \\ \text{subject to} & x_1+x_2+x_3 & \geq 6 \\ & 2x_1+x_3 & \geq 2 \\ & x_2+x_3 & \geq 1 & \\ & x_1,x_2,x_3 & \geq 0 \end{matrix}$

Just for giggles let’s write out what $A$ and $c$ are.

$\displaystyle A = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 0 & 1 \\ 0 & 1 & 1 \end{pmatrix}, c = (4,3,9), v = (6,2,1)$

Say you want to come up with a lower bound on the optimal solution to your problem. That is, you want to know that you can’t make $4x_1 + 3x_2 + 9x_3$ smaller than some number $m$. The constraints can help us derive such lower bounds. In particular, every variable has to be nonnegative, so we know that $4x_1 + 3x_2 + 9x_3 \geq x_1 + x_2 + x_3 \geq 6$, and so 6 is a lower bound on our optimum. Likewise,

\displaystyle \begin{aligned}4x_1+3x_2+9x_3 & \geq 4x_1+4x_3+3x_2+3x_3 \\ &=2(2x_1 + x_3)+3(x_2+x_3) \\ & \geq 2 \cdot 2 + 3 \cdot 1 \\ &=7\end{aligned}

and that’s an even better lower bound than 6. We could try to write this approach down in general: find some numbers $y_1, y_2, y_3$ that we’ll use for each constraint to form

$\displaystyle y_1(\text{constraint 1}) + y_2(\text{constraint 2}) + y_3(\text{constraint 3})$

To make it a valid lower bound we need to ensure that the coefficients of each of the $x_i$ are smaller than the coefficients in the objective function (i.e. that the coefficient of $x_1$ ends up less than 4). And to make it the best lower bound possible we want to maximize what the right-hand-size of the inequality would be: $y_1 6 + y_2 2 + y_3 1$. If you write out these equations and the constraints you get our “lower bound” problem written as

$\displaystyle \begin{matrix} \text{maximize} & 6y_1 + 2y_2 + y_3 & \\ \text{subject to} & y_1 + 2y_2 & \leq 4 \\ & y_1 + y_3 & \leq 3 \\ & y_1+y_2 + y_3 & \leq 9 \\ & y_1,y_2,y_3 & \geq 0 \end{matrix}$

And wouldn’t you know, the matrix providing the constraints is $A^T$, and the vectors $c$ and $v$ switched places.

$\displaystyle A^T = \begin{pmatrix} 1 & 2 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{pmatrix}$

This is no coincidence. All linear programs can be transformed in this way, and it would be a useful exercise for the reader to turn the above maximization problem back into a minimization problem by the same technique (computing linear combinations of the constraints to make upper bounds). You’ll be surprised to find that you get back to the original minimization problem! This is part of what makes it “duality,” because the dual of the dual is the original thing again. Often, when we fix the “original” problem, we call it the primal form to distinguish it from the dual form. Usually the primal problem is the one that is easy to interpret.

(Note: because we’re done with broccoli for now, we’re going to use $b$ to denote the constraint vector that used to be $v$.)

Now say you’re given the data of a linear program for minimization, that is the vectors $c, b$ and matrix $A$ for the problem, “minimize $\left \langle c, x \right \rangle$ subject to $Ax \geq b; x \geq 0$.” We can make a general definition: the dual linear program is the maximization problem “maximize $\left \langle b, y \right \rangle$ subject to $A^T y \leq c, y \geq 0$.” Here $y$ is the new set of variables and the superscript T denotes the transpose of the matrix. The constraint for the dual is often written $y^T A \leq c^T$, again identifying vectors with a single-column matrices, but I find the swamp of transposes pointless and annoying (why do things need to be columns?).

Now we can actually prove that the objective function for the dual provides a bound on the objective function for the original problem. It’s obvious from the work we’ve done, which is why it’s called the weak duality theorem.

Weak Duality Theorem: Let $c, A, b$ be the data of a linear program in the primal form (the minimization problem) whose objective function is $\left \langle c, x \right \rangle$. Recall that the objective function of the dual (maximization) problem is $\left \langle b, y \right \rangle$. If $x,y$ are feasible solutions (satisfy the constraints of their respective problems), then

$\left \langle b, y \right \rangle \leq \left \langle c, x \right \rangle$

In other words, the maximum of the dual is a lower bound on the minimum of the primal problem and vice versa. Moreover, any feasible solution for one provides a bound on the other.

Proof. The proof is pleasingly simple. Just inspect the quantity $\left \langle A^T y, x \right \rangle = \left \langle y, Ax \right \rangle$. The constraints from the definitions of the primal and dual give us that

$\left \langle y, b \right \rangle \leq \left \langle y, Ax \right \rangle = \left \langle A^Ty, x \right \rangle \leq \left \langle c,x \right \rangle$

The inequalities follow from the linear algebra fact that if the $u$ in $\left \langle u,v \right \rangle$ is nonnegative, then you can only increase the size of the product by increasing the components of $v$. This is why we need the nonnegativity constraints.

In fact, the world is much more pleasing. There is a theorem that says the two optimums are equal!

Strong Duality Theorem: If there are any solutions $x,y$ to the primal (minimization) problem and the dual (maximization) problem, respectively, then the two problems also have optimal solutions $x^*, y^*$, and two candidate solutions $x^*, y^*$ are optimal if and only if they produce equal objective values $\left \langle c, x^* \right \rangle = \left \langle y^*, b \right \rangle$.

The proof of this theorem is a bit more convoluted than the weak duality theorem, and the key technique is a lemma of Farkas and its variations. See the second half of these notes for a full proof. The nice thing is that this theorem gives us a way to tell if an algorithm to solve linear programs is done: maintain a pair of feasible solutions to the primal and dual problems, improve them by some rule, and stop when the two solutions give equal objective values. The hard part, then, is finding a principled and guaranteed way to improve a given pair of solutions.

On the other hand, you can also prove the strong duality theorem by inventing an algorithm that provably terminates. We’ll see such an algorithm, known as the simplex algorithm in the next post. Sneak peek: it’s a lot like Gaussian elimination. Then we’ll use the algorithm (or an equivalent industry-strength version) to solve a much bigger nutrition problem.

In fact, you can do a bit better than the strong duality theorem, in terms of coming up with a stopping condition for a linear programming algorithm. You can observe that an optimal solution implies further constraints on the relationship between the primal and the dual problems. In particular, this is called the complementary slackness conditions, and they essentially say that if an optimal solution to the primal has a positive variable then the corresponding constraint in the dual problem must be tight (is an equality) to get an optimal solution to the dual. The contrapositive says that if some constraint is slack, or a strict inequality, then either the corresponding variable is zero or else the solution is not optimal. More formally,

Theorem (Complementary Slackness Conditions): Let $A, c, b$ be the data of the primal form of a linear program, “minimize $\left \langle c, x \right \rangle$ subject to $Ax \geq b, x \geq 0$.” Then $x^*, y^*$ are optimal solutions to the primal and dual problems if any only if all of the following conditions hold.

• $x^*, y^*$ are both feasible for their respective problems.
• Whenever $x^*_i > 0$ the corresponding constraint $A^T_i y^* = c_i$ is an equality.
• Whenever $y^*_j > 0$ the corresponding constraint $A_j x^* = b_j$ is an equality.

Here we denote by $M_i$ the $i$-th row of the matrix $M$ and $v_i$ to denote the $i$-th entry of a vector. Another way to write the condition using vectors instead of English is

$\left \langle x^*, A^T y^* - c \right \rangle = 0$
$\left \langle y^*, Ax^* - b \right \rangle$

The proof follows from the duality theorems, and just involves pushing around some vector algebra. See section 6.2 of these notes.

One can interpret complementary slackness in linear programs in a lot of different ways. For us, it will simply be a termination condition for an algorithm: one can efficiently check all of these conditions for the nonzero variables and stop if they’re all satisfied or if we find a variable that violates a slackness condition. Indeed, in more mature optimization analyses, the slackness condition that is more egregiously violated can provide evidence for where a candidate solution can best be improved. For a more intricate and detailed story about how to interpret the complementary slackness conditions, see Section 4 of these notes by Joel Sobel.

Finally, before we close we should note there are geometric ways to think about linear programming. I have my preferred visualization in my head, but I have yet to find a suitable animation on the web that replicates it. Here’s one example in two dimensions. The set of constraints define a convex geometric region in the plane

The constraints define a convex area of “feasible solutions.” Image source: Wikipedia.

Now the optimization function $f(x) = \left \langle c,x \right \rangle$ is also a linear function, and if you fix some output value $y = f(x)$ this defines a line in the plane. As $y$ changes, the line moves along its normal vector (that is, all these fixed lines are parallel). Now to geometrically optimize the target function, we can imagine starting with the line $f(x) = 0$, and sliding it along its normal vector in the direction that keeps it in the feasible region. We can keep sliding it in this direction, and the maximum of the function is just the last instant that this line intersects the feasible region. If none of the constraints are parallel to the family of lines defined by $f$, then this is guaranteed to occur at a vertex of the feasible region. Otherwise, there will be a family of optima lying anywhere on the line segment of last intersection.

In higher dimensions, the only change is that the lines become affine subspaces of dimension $n-1$. That means in three dimensions you’re sliding planes, in four dimensions you’re sliding 3-dimensional hyperplanes, etc. The facts about the last intersection being a vertex or a “line segment” still hold. So as we’ll see next time, successful algorithms for linear programming in practice take advantage of this observation by efficiently traversing the vertices of this convex region. We’ll see this in much more detail in the next post.

Until then!

# Community Detection in Graphs — a Casual Tour

Graphs are among the most interesting and useful objects in mathematics. Any situation or idea that can be described by objects with connections is a graph, and one of the most prominent examples of a real-world graph that one can come up with is a social network.

Recall, if you aren’t already familiar with this blog’s gentle introduction to graphs, that a graph $G$ is defined by a set of vertices $V$, and a set of edges $E$, each of which connects two vertices. For this post the edges will be undirected, meaning connections between vertices are symmetric.

One of the most common topics to talk about for graphs is the notion of a community. But what does one actually mean by that word? It’s easy to give an informal definition: a subset of vertices $C$ such that there are many more edges between vertices in $C$ than from vertices in $C$ to vertices in $V - C$ (the complement of $C$). Try to make this notion precise, however, and you open a door to a world of difficult problems and open research questions. Indeed, nobody has yet come to a conclusive and useful definition of what it means to be a community. In this post we’ll see why this is such a hard problem, and we’ll see that it mostly has to do with the word “useful.” In future posts we plan to cover some techniques that have found widespread success in practice, but this post is intended to impress upon the reader how difficult the problem is.

## The simplest idea

The simplest thing to do is to say a community is a subset of vertices which are completely connected to each other. In the technical parlance, a community is a subgraph which forms a clique. Sometimes an $n$-clique is also called a complete graph on $n$ vertices, denoted $K_n$. Here’s an example of a 5-clique in a larger graph:

“Where’s Waldo” for graph theorists: a clique hidden in a larger graph.

Indeed, it seems reasonable that if we can reliably find communities at all, then we should be able to find cliques. But as fate should have it, this problem is known to be computationally intractable. In more detail, the problem of finding the largest clique in a graph is NP-hard. That essentially means we don’t have any better algorithms to find cliques in general graphs than to try all possible subsets of the vertices and check to see which, if any, form cliques. In fact it’s much worse, this problem is known to be hard to approximate to any reasonable factor in the worst case (the error of the approximation grows polynomially with the size of the graph!). So we can’t even hope to find a clique half the size of the biggest, or a thousandth the size!

But we have to take these impossibility results with a grain of salt: they only say things about the worst case graphs. And when we’re looking for communities in the real world, the worst case will never show up. Really, it won’t! In these proofs, “worst case” means that they encode some arbitrarily convoluted logic problem into a graph, so that finding the clique means solving the logic problem. To think that someone could engineer their social network to encode difficult logic problems is ridiculous.

So what about an “average case” graph? To formulate this typically means we need to consider graphs randomly drawn from a distribution.

## Random graphs

The simplest kind of “randomized” graph you could have is the following. You fix some set of vertices, and then run an experiment: for each pair of vertices you flip a coin, and if the coin is heads you place an edge and otherwise you don’t. This defines a distribution on graphs called $G(n, 1/2)$, which we can generalize to $G(n, p)$ for a coin with bias $p$. With a slight abuse of notation, we call $G(n, p)$ the Erdős–Rényi random graph (it’s not a graph but a distribution on graphs). We explored this topic form a more mathematical perspective earlier on this blog.

So we can sample from this distribution and ask questions like: what’s the probability of the largest clique being size at least $20$? Indeed, cliques in Erdős–Rényi random graphs are so well understood that we know exactly how they work. For example, if $p=1/2$ then the size of the largest clique is guaranteed (with overwhelming probability as $n$ grows) to have size $k(n)$ or $k(n)+1$, where $k(n)$ is about $2 \log n$. Just as much is known about other values of $p$ as well as other properties of $G(n,p)$, see Wikipedia for a short list.

In other words, if we wanted to find the largest clique in an Erdős–Rényi random graph, we could check all subsets of size roughly $2\log(n)$, which would take about $(n / \log(n))^{\log(n)}$ time. This is pretty terrible, and I’ve never heard of an algorithm that does better (contrary to the original statement in this paragraph that showed I can’t count). In any case, it turns out that the Erdős–Rényi random graph, and using cliques to represent communities, is far from realistic. There are many reasons why this is the case, but here’s one example that fits with the topic at hand. If I thought the world’s social network was distributed according to $G(n, 1/2)$ and communities were cliques, then I would be claiming that the largest community is of size 65 or 66. Estimated world population: 7 billion, $2 \log(7 \cdot 10^9) \sim 65$. Clearly this is ridiculous: there are groups of larger than 66 people that we would want to call “communities,” and there are plenty of communities that don’t form bona-fide cliques.

Another avenue shows that things are still not as easy as they seem in Erdős–Rényi land. This is the so-called planted clique problem. That is, you draw a graph $G$ from $G(n, 1/2)$. You give $G$ to me and I pick a random but secret subset of $r$ vertices and I add enough edges to make those vertices form an $r$-clique. Then I ask you to find the $r$-clique. Clearly it doesn’t make sense when $r < 2 \log (n)$ because you won’t be able to tell it apart from the guaranteed cliques in $G$. But even worse, nobody knows how to find the planted clique when $r$ is even a little bit smaller than $\sqrt{n}$ (like, $r = n^{9/20}$ even). Just to solidify this with some numbers, we don’t know how to reliably find a planted clique of size 60 in a random graph on ten thousand vertices, but we do when the size of the clique goes up to 100. The best algorithms we know rely on some sophisticated tools in spectral graph theory, and their details are beyond the scope of this post.

So Erdős–Rényi graphs seem to have no hope. What’s next? There are a couple of routes we can take from here. We can try to change our random graph model to be more realistic. We can relax our notion of communities from cliques to something else. We can do both, or we can do something completely different.

## Other kinds of random graphs

There is an interesting model of Barabási and Albert, often called the “preferential attachment” model, that has been described as a good model of large, quickly growing networks like the internet. Here’s the idea: you start off with a two-clique $G = K_2$, and at each time step $t$ you add a new vertex $v$ to $G$, and new edges so that the probability that the edge $(v,w)$ is added to $G$ is proportional to the degree of $w$ (as a fraction of the total number of edges in $G$). Here’s an animation of this process:

Image source: Wikipedia

The significance of this random model is that it creates graphs with a small number of hubs, and a large number of low-degree vertices. In other words, the preferential attachment model tends to “make the rich richer.” Another perspective is that the degree distribution of such a graph is guaranteed to fit a so-called power-law distribution. Informally, this means that the overall fraction of small-degree vertices gives a significant contribution to the total number of edges. This is sometimes called a “fat-tailed” distribution. Since power-law distributions are observed in a wide variety of natural settings, some have used this as justification for working in the preferential attachment setting. On the other hand, this model is known to have no significant community structure (by any reasonable definition, certainly not having cliques of nontrivial size), and this has been used as evidence against the model. I am not aware of any work done on planting dense subgraphs in graphs drawn from a preferential attachment model, but I think it’s likely to be trivial and uninteresting. On the other hand, Bubeck et al. have looked at changing the initial graph (the “seed”) from a 2-clique to something else, and seeing how that affects the overall limiting distribution.

Another model that often shows up is a model that allows one to make a random graph starting with any fixed degree distribution, not just a power law. There are a number of models that do this to some fashion, and you’ll hear a lot of hyphenated names thrown around like Chung-Lu and Molloy-Reed and Newman-Strogatz-Watts. The one we’ll describe is quite simple. Say you start with a set of vertices $V$, and a number $d_v$ for each vertex $v$, such that the sum of all the $d_v$ is even. This condition is required because in any graph the sum of the degrees of a vertex is twice the number of edges. Then you imagine each vertex $v$ having $d_v$ “edge-stubs.” The name suggests a picture like the one below:

Each node has a prescribed number of “edge stubs,” which are randomly connected to form a graph.

Now you pick two edge stubs at random and connect them. One usually allows self-loops and multiple edges between vertices, so that it’s okay to pick two edge stubs from the same vertex. You keep doing this until all the edge stubs are accounted for, and this is your random graph. The degrees were fixed at the beginning, so the only randomization is in which vertices are adjacent. The same obvious biases apply, that any given vertex is more likely to be adjacent to high-degree vertices, but now we get to control the biases with much more precision.

The reason such a model is useful is that when you’re working with graphs in the real world, you usually have statistical information available. It’s simple to compute the degree of each vertex, and so you can use this random graph as a sort of “prior” distribution and look for anomalies. In particular, this is precisely how one of the leading measures of community structure works: the measure of modularity. We’ll talk about this in the next section.

## Other kinds of communities

Here’s one easy way to relax our notion of communities. Rather than finding complete subgraphs, we could ask about finding very dense subgraphs (ignoring what happens outside the subgraph). We compute density as the average degree of vertices in the subgraph.

If we impose no bound on the size of the subgraph an algorithm is allowed to output, then there is an efficient algorithm for finding the densest subgraph in a given graph. The general exact solution involves solving a linear programming problem and a little extra work, but luckily there is a greedy algorithm that can get within half of the optimal density. You start with all the vertices $S_n = V$, and remove any vertex of minimal degree to get $S_{n-1}$. Continue until $S_0$, and then compute the density of all the $S_i$. The best one is guaranteed to be at least half of the optimal density. See this paper of Moses Charikar for a more formal analysis.

One problem with this is that the size of the densest subgraph might be too big. Unfortunately, if you fix the size of the dense subgraph you’re looking for (say, you want to find the densest subgraph of size at most $k$ where $k$ is an input), then the problem once again becomes NP-hard and suffers from the same sort of inapproximability theorems as finding the largest clique.

A more important issue with this is that a dense subgraph isn’t necessarily a community. In particular, we want communities to be dense on the inside and sparse on the outside. The densest subgraph analysis, however, might rate the following graph as one big dense subgraph instead of two separately dense communities with some modest (but not too modest) amount of connections between them.

What are the correct communities here?

Indeed, we want a quantifiable a notion of “dense on the inside and sparse on the outside.” One such formalization is called modularity. Modularity works as follows. If you give me some partition of the vertices of $G$ into two sets, modularity measures how well this partition reflects two separate communities. It’s the definition of “community” here that makes it interesting. Rather than ask about densities exactly, you can compare the densities to the expected densities in a given random graph model.

In particular, we can use the fixed-degree distribution model from the last section. If we know the degrees of all the vertices ahead of time, we can compute the probability that we see some number of edges going between the two pieces of the partition relative to what we would see at random. If the difference is large (and largely biased toward fewer edges across the partition and more edges within the two subsets), then we say it has high modularity. This involves a lot of computations  — the whole measure can be written as a quadratic form via one big matrix — but the idea is simple enough. We intend to write more about modularity and implement the algorithm on this blog, but the excited reader can see the original paper of M.E.J. Newman.

Now modularity is very popular but it too has shortcomings. First, even though you can compute the modularity of a given partition, there’s still the problem of finding the partition that globally maximizes modularity. Sadly, this is known to be NP-hard. Mover, it’s known to be NP-hard even if you’re just trying to find a partition into two pieces that maximizes modularity, and even still when the graph is regular (every vertex has the same degree).

Still worse, while there are some readily accepted heuristics that often “do well enough” in practice, we don’t even know how to approximate modularity very well. Bhaskar DasGupta has a line of work studying approximations of maximum modularity, and he has proved that for dense graphs you can’t even approximate modularity to within any constant factor. That is, the best you can do is have an approximation that gets worse as the size of the graph grows. It’s similar to the bad news we had for finding the largest clique, but not as bad. For example, when the graph is sparse it’s known that one can approximate modularity to within a $\log(n)$ factor of the optimum, where $n$ is the number of vertices of the graph (for cliques the factor was like $n^c$ for some $c$, and this is drastically worse).

Another empirical issue is that modularity seems to fail to find small communities. That is, if your graph has some large communities and some small communities, strictly maximizing the modularity is not the right thing to do. So we’ve seen that even the leading method in the field has some issues.

## Something completely different

The last method I want to sketch is in the realm of “something completely different.” The notion is that if we’re given a graph, we can run some experiment on the graph, and the results of that experiment can give us insight into where the communities are.

The experiment I’m going to talk about is the random walk. That is, say you have a vertex $v$ in a graph $G$ and you want to find some vertices that are “closest” to $v$. That is, those that are most likely to be in the same community as $v$. What you can do is run a random walk starting at $v$. By a “random walk” I mean you start at $v$, you pick a neighbor at random and move to it, then repeat. You can compute statistics about the vertices you visit in a sample of such walks, and the vertices that you visit most often are those you say are “in the same community as $v$. One important parameter is how long the walk is, but it’s generally believed to be best if you keep it between 3-6 steps.

Of course, this is not a partition of the vertices, so it’s not a community detection algorithm, but you can turn it into one. Run this process for each vertex, and use it to compute a “distance” between all the pairs of vertices. Then you compute a tree of partitions by lumping the closest pairs of vertices into the same community, one at a time, until you’ve got every vertex. At each step of the way, you compute the modularity of the partition, and when you’re done you choose the partition that maximizes modularity. This algorithm as a whole is called the walktrap clustering algorithm, and was introduced by Pons and Latapy in 2005.

This sounds like a really great idea, because it’s intuitive: there’s a relatively high chance that the friends of your friends are also your friends. It’s also really great because there is an easily measurable tradeoff between runtime and quality: you can tune down the length of the random walk, and the number of samples you take for each vertex, to speed up the runtime but lower the quality of your statistical estimates. So if you’re working on huge graphs, you get a lot of control and a clear idea of exactly what’s going on inside the algorithm (something which is not immediately clear in a lot of these papers).

Unfortunately, I’m not aware of any concrete theoretical guarantees for walktrap clustering. The one bit of theoretical justification I’ve read over the last year is that you can relate the expected distances you get to certain spectral properties of the graph that are known to be related to community structure, but the lower bounds on maximizing modularity already suggest (though they do not imply) that walktrap won’t do that well in the worst case.

## So many algorithms, so little time!

I have only brushed the surface of the literature on community detection, and the things I have discussed are heavily biased toward what I’ve read about and used in my own research. There are methods based on information theory, label propagation, and obscure physics processes like “spin glass” (whatever that is, it sounds frustrating).

And we have only been talking about perfect community structure. What if you want to allow people to be in multiple communities, or have communities at varying levels of granularity (e.g. a sports club within a school versus the whole student body of that school)? What if we want to allow people to be “members” of a community at varying degrees of intensity? How do we deal with noisy signals in our graphs? For example, if we get our data from observing people talk, are two people who have heated arguments considered to be in the same community? Since a lot social network data comes from sources like Twitter and Facebook where arguments are rampant, how do we distinguish between useful and useless data? More subtly, how do we determine useful information if a group within the social network are trying to mask their discovery? That is, how do we deal with adversarial noise in a graph?

And all of this is just on static graphs! What about graphs that change over time? You can keep making the problem more and more complicated as it gets more realistic.

With the huge wealth of research that has already been done just on the simplest case, and the difficult problems and known barriers to success even for the simple problems, it seems almost intimidating to even begin to try to answer these questions. But maybe that’s what makes them fascinating, not to mention that governments and big businesses pour many millions of dollars into this kind of research.

In the future of this blog we plan to derive and implement some of the basic methods of community detection. This includes, as a first outline, the modularity measure and the walktrap clustering algorithm. Considering that I’m also going to spend a large part of the summer thinking about these problems (indeed, with some of the leading researchers and upcoming stars under the sponsorship of the American Mathematical Society), it’s unlikely to end there.

Until next time!