Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).
For example, if I have the following three “points” in the plane, as indicated by their colors, which is closer, blue to green, or blue to red?
It’s not obvious, and there are multiple factors at work: the red points have fewer samples, but we can be more certain about the position; the blue points are less certain, but the closest non-blue point to a blue point is green; and the green points are equally plausibly “close to red” and “close to blue.” The centers of masses of the three sample sets are close to an equilateral triangle. In our example the “points” don’t overlap, but of course they could. And in particular, there should probably be a nonzero distance between two points whose sample sets have the same center of mass, as below. The distance quantifies the uncertainty.
All this is to say that it’s not obvious how to define a distance measure that is consistent with perceptual ideas of what geometry and distance should be.
Solution (Earthmoverdistance): Treat each sample set $ A$ corresponding to a “point” as a discrete probability distribution, so that each sample $ x \in A$ has probability mass $ p_x = 1 / |A|$. The distance between $ A$ and $ B$ is the optional solution to the following linear program.
Each $ x \in A$ corresponds to a pile of dirt of height $ p_x$, and each $ y \in B$ corresponds to a hole of depth $ p_y$. The cost of moving a unit of dirt from $ x$ to $ y$ is the Euclidean distance $ d(x, y)$ between the points (or whatever hipster metric you want to use).
Let $ z_{x, y}$ be a real variable corresponding to an amount of dirt to move from $ x \in A$ to $ y \in B$, with cost $ d(x, y)$. Then the constraints are:
Each $ z_{x, y} \geq 0$, so dirt only moves from $ x$ to $ y$.
Every pile $ x \in A$ must vanish, i.e. for each fixed $ x \in A$, $ \sum_{y \in B} z_{x,y} = p_x$.
Likewise, every hole $ y \in B$ must be completely filled, i.e. $ \sum_{y \in B} z_{x,y} = p_y$.
The objective is to minimize the cost of doing this: $ \sum_{x, y \in A \times B} d(x, y) z_{x, y}$.
In python, using the ortools library (and leaving out a few docstrings and standard import statements, full code on Github):
from ortools.linear_solver import pywraplp
def earthmover_distance(p1, p2):
dist1 = {x: count / len(p1) for (x, count) in Counter(p1).items()}
dist2 = {x: count / len(p2) for (x, count) in Counter(p2).items()}
solver = pywraplp.Solver('earthmover_distance', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)
variables = dict()
# for each pile in dist1, the constraint that says all the dirt must leave this pile
dirt_leaving_constraints = defaultdict(lambda: 0)
# for each hole in dist2, the constraint that says this hole must be filled
dirt_filling_constraints = defaultdict(lambda: 0)
# the objective
objective = solver.Objective()
objective.SetMinimization()
for (x, dirt_at_x) in dist1.items():
for (y, capacity_of_y) in dist2.items():
amount_to_move_x_y = solver.NumVar(0, solver.infinity(), 'z_{%s, %s}' % (x, y))
variables[(x, y)] = amount_to_move_x_y
dirt_leaving_constraints[x] += amount_to_move_x_y
dirt_filling_constraints[y] += amount_to_move_x_y
objective.SetCoefficient(amount_to_move_x_y, euclidean_distance(x, y))
for x, linear_combination in dirt_leaving_constraints.items():
solver.Add(linear_combination == dist1[x])
for y, linear_combination in dirt_filling_constraints.items():
solver.Add(linear_combination == dist2[y])
status = solver.Solve()
if status not in [solver.OPTIMAL, solver.FEASIBLE]:
raise Exception('Unable to find feasible solution')
return objective.Value()
Discussion: I’ve heard about this metric many times as a way to compare probability distributions. For example, it shows up in an influential paper about fairness in machine learning, and a few other CS theory papers related to distribution testing.
One might ask: why not use other measures of dissimilarity for probability distributions (Chi-squared statistic, Kullback-Leibler divergence, etc.)? One answer is that these other measures only give useful information for pairs of distributions with the same support. An example from a talk of Justin Solomon succinctly clarifies what Earthmover distance achieves
Also, why not just model the samples using, say, a normal distribution, and then compute the distance based on the parameters of the distributions? That is possible, and in fact makes for a potentially more efficient technique, but you lose some information by doing this. Ignoring that your data might not be approximately normal (it might have some curvature), with Earthmover distance, you get point-by-point details about how each data point affects the outcome.
This has the potential to be useful in redistricting because of the nature of the redistricting problem. As I wrote previously, discussions of redistricting are chock-full of geometry—or at least geometric-sounding language—and people are very concerned with the apparent “compactness” of a districting plan. But the underlying data used to perform redistricting isn’t very accurate. The people who build the maps don’t have precise data on voting habits, or even locations where people live. Census tracts might not be perfectly aligned, and data can just plain have errors and uncertainty in other respects. So the data that district-map-drawers care about is uncertain much like our point clouds. With a theory of geometry that accounts for uncertainty (and the Earthmover distance is the “distance” part of that), one can come up with more robust, better tools for redistricting.
Solomon’s website has a ton of resources about this, under the names of “optimal transport” and “Wasserstein metric,” and his work extends from computing distances to computing important geometric values like the barycenter, computational advantages like parallelism.
Others in the field have come up with transparency techniques to make it clearer how the Earthmover distance relates to the geometry of the underlying space. This one is particularly fun because the explanations result in a path traveled from the start to the finish, and by setting up the underlying metric in just such a way, you can watch the distribution navigate a maze to get to its target. I like to imagine tiny ants carrying all that dirt.
Finally, work of Shirdhonkar and Jacobs provide approximation algorithms that allow linear-time computation, instead of the worst-case-cubic runtime of a linear solver.
Problem: You have a catalog of items with discrete ratings (thumbs up/thumbs down, or 5-star ratings, etc.), and you want to display them in the “right” order.
Solution: In Python
'''
score: [int], [int], [float] -> float
Return the expected value of the rating for an item with known
ratings specified by `ratings`, prior belief specified by
`rating_prior`, and a utility function specified by `rating_utility`,
assuming the ratings are a multinomial distribution and the prior
belief is a Dirichlet distribution.
'''
def score(self, ratings, rating_prior, rating_utility):
ratings = [r + p for (r, p) in zip(ratings, rating_prior)]
score = sum(r * u for (r, u) in zip(ratings, rating_utility))
return score / sum(ratings)
Discussion: This deceptively short solution can lead you on a long and winding path into the depths of statistics. I will do my best to give a short, clear version of the story.
As a working example I chose merely because I recently listened to a related podcast, say you’re selling mass-market romance novels—which, by all accounts, is a predictable genre. You have a list of books, each of which has been rated on a scale of 0-5 stars by some number of users. You want to display the top books first, so that time-constrained readers can experience the most titillating novels first, and newbies to the genre can get the best first time experience and be incentivized to buy more.
The setup required to arrive at the above code is the following, which I’ll phrase as a story.
Users’ feelings about a book, and subsequent votes, are independent draws from a known distribution (with unknown parameters). I will just call these distributions “discrete” distributions. So given a book and user, there is some unknown list $ (p_0, p_1, p_2, p_3, p_4, p_5)$ of probabilities ($ \sum_i p_i = 1$) for each possible rating a user could give for that book.
But how do users get these probabilities? In this story, the probabilities are the output of a randomized procedure that generates distributions. That modeling assumption is called a “Dirichlet prior,” with Dirichlet meaning it generates discrete distributions, and prior meaning it encodes domain-specific information (such as the fraction of 4-star ratings for a typical romance novel).
So the story is you have a book, and that book gets a Dirichlet distribution (unknown to us), and then when a user comes along they sample from the Dirichlet distribution to get a discrete distribution, which they then draw from to choose a rating. We observe the ratings, and we need to find the book’s underlying Dirichlet. We start by assigning it some default Dirichlet (the prior) and update that Dirichlet as we observe new ratings. Some other assumptions:
Books are indistinguishable except in the parameters of their Dirichlet distribution.
The parameters of a book’s Dirichlet distribution don’t change over time, and inherently reflect the book’s value.
So a Dirichlet distribution is a process that produces discrete distributions. For simplicity, in this post we will say a Dirichlet distribution is parameterized by a list of six integers $ (n_0, \dots, n_5)$, one for each possible star rating. These values represent our belief in the “typical” distribution of votes for a new book. We’ll discuss more about how to set the values later. Sampling a value (a book’s list of probabilities) from the Dirichlet distribution is not trivial, but we don’t need to do that for this program. Rather, we need to be able to interpret a fixed Dirichlet distribution, and update it given some observed votes.
The interpretation we use for a Dirichlet distribution is its expected value, which, recall, is the parameters of a discrete distribution. In particular if $ n = \sum_i n_i$, then the expected value is a discrete distribution whose probabilities are
So you can think of each integer in the specification of a Dirichlet as “ghost ratings,” sometimes called pseudocounts, and we’re saying the probability is proportional to the count.
This is great, because if we knew the true Dirichlet distribution for a book, we could compute its ranking without a second thought. The ranking would simply be the expected star rating:
def simple_score(distribution):
return sum(i * p for (i, p) in enumerate(distribution))
Putting books with the highest score on top would maximize the expected happiness of a user visiting the site, provided that happiness matches the user’s voting behavior, since the simple_score is just the expected vote.
Also note that all the rating system needs to make this work is that the rating options are linearly ordered. So a thumbs up/down (heaving bosom/flaccid member?) would work, too. We don’t need to know how happy it makes them to see a 5-star vs 4-star book. However, because as we’ll see next we have to approximate the distribution, and hence have uncertainty for scores of books with only a few ratings, it helps to incorporate numerical utility values (we’ll see this at the end).
Next, to update a given Dirichlet distribution with the results of some observed ratings, we have to dig a bit deeper into Bayes rule and the formulas for sampling from a Dirichlet distribution. Rather than do that, I’ll point you to this nice writeup by Jonathan Huang, where the core of the derivation is in Section 2.3 (page 4), and remark that the rule for updating for a new observation is to just add it to the existing counts.
Theorem: Given a Dirichlet distribution with parameters $ (n_1, \dots, n_k)$ and a new observation of outcome $ i$, the updated Dirichlet distribution has parameters $ (n_1, \dots, n_{i-1}, n_i + 1, n_{i+1}, \dots, n_k)$. That is, you just update the $ i$-th entry by adding $ 1$ to it.
This particular arithmetic to do the update is a mathematical consequence (derived in the link above) of the philosophical assumption that Bayes rule is how you should model your beliefs about uncertainty, coupled with the assumption that the Dirichlet process is how the users actually arrive at their votes.
The initial values $ (n_0, \dots, n_5)$ for star ratings should be picked so that they represent the average rating distribution among all prior books, since this is used as the default voting distribution for a new, unknown book. If you have more information about whether a book is likely to be popular, you can use a different prior. For example, if JK Rowling wrote a Harry Potter Romance novel that was part of the canon, you could pretty much guarantee it would be popular, and set $ n_5$ high compared to $ n_0$. Of course, if it were actually popular you could just wait for the good ratings to stream in, so tinkering with these values on a per-book basis might not help much. On the other hand, most books by unknown authors are bad, and $ n_5$ should be close to zero. Selecting a prior dictates how influential ratings of new items are compared to ratings of items with many votes. The more pseudocounts you add to the prior, the less new votes count.
This gets us to the following code for star ratings.
def score(self, ratings, rating_prior):
ratings = [r + p for (r, p) in zip(ratings, rating_prior)]
score = sum(i * u for (i, u) in enumerate(ratings))
return score / sum(ratings)
The only thing missing from the solution at the beginning is the utilities. The utilities are useful for two reasons. First, because books with few ratings encode a lot of uncertainty, having an idea about how extreme a feeling is implied by a specific rating allows one to give better rankings of new books.
Second, for many services, such as taxi rides on Lyft, the default star rating tends to be a 5-star, and 4-star or lower mean something went wrong. For books, 3-4 stars is a default while 5-star means you were very happy.
The utilities parameter allows you to weight rating outcomes appropriately. So if you are in a Lyft-like scenario, you might specify utilities like [-10, -5, -3, -2, 1] to denote that a 4-star rating has the same negative impact as two 5-star ratings would positively contribute. On the other hand, for books the gap between 4-star and 5-star is much less than the gap between 3-star and 4-star. The utilities simply allow you to calibrate how the votes should be valued in comparison to each other, instead of using their literal star counts.
Last time we left off with a tantalizing conjecture: a random graph with edge probability $ p = 5/n$ is almost surely a connected graph. We arrived at that conjecture from some ad-hoc data analysis, so let’s go back and treat it with some more rigorous mathematical techniques. As we do, we’ll discover some very interesting “threshold theorems” that essentially say a random graph will either certainly have a property, or it will certainly not have it.
The phase transition we empirically observed from last time.
Big components
Recalling the basic definition: an Erdős-Rényi (ER) random graph with $ n$ vertices and edge probability $ p$ is a probability distribution over all graphs on $ n$ vertices. Generatively, you draw from an ER distribution by flipping a $ p$-biased coin for each pair of vertices, and adding the edge if you flip heads. We call the random event of drawing a graph from this distribution a “random graph” even though it’s not a graph, and we denote an ER random graph by $ G(n,p)$. When $ p = 1/2$, the distribution $ G(n,1/2)$ is the uniform distribution over all graphs on $ n$ vertices.
Now let’s get to some theorems. The main tools we’ll use are called the first and second moment method. Let’s illustrate them by example.
The first moment method
Say we want to know what values of $ p$ are likely to produce graphs with isolated vertices (vertices with no neighbors), and which are not. Of course, the value of $ p$ will depend on $ n \to \infty$ in general, but we can already see by example that if $ p = 1/2$ then the probability of a fixed vertex being isolated is $ 2^{-n} \to 0$. We can use the union bound (sum this value over all vertices) to show that the probability of any vertex being isolated is at most $ n2^{-n}$ which also tends to zero very quickly. This is not the first moment method, I’m just making the point that all of our results will be interpreted asymptotically as $ n \to \infty$.
So now we can ask: what is the expected number of isolated vertices? If I call $ X$ the random variable that counts the expected number of isolatedvertices, then I’m asking about $ \mathbb{E}[X]$. Really what I’m doing is interpreting $ X$ as a random variable depending on $ n, p(n)$, and asking about the evolution of $ \mathbb{E}[X]$ as $ n \to \infty$.
Now the first moment method states, somewhat obviously, that if the expectation tends to zero then the value of $ X$ itself also tends to zero. Indeed, this follows from Markov’s inequality, which states that the probability that $ X \geq a$ is bounded by $ \mathbb{E}[X]/a$. In symbols,
In our case $ X$ is counting something (it’s integer valued), so asking whether $ X > 0$ is equivalent to asking whether $ X \geq 1$. The upper bound on the probability of $ X$ being strictly positive is then just $ \mathbb{E}[X]$.
So let’s find out when the expected number of isolated vertices goes to zero. We’ll use the wondrous linearity of expectation to split $ X$ into a sum of counts for each vertex. That is, if $ X_i$ is 1 when vertex $ i$ is isolated and 0 otherwise (this is called an indicator variable), then $ X = \sum_{i=1}^n X_i$ and linearity of expectation gives
Now the expectation of an indicator random variable is just the probability that the event occurs (it’s trivial to check). It’s easy to compute the probability that a vertex is isolated: it’s $ (1-p)^n$. So the sum above works out to be $ n(1-p)^n$. It should really be $ n(1-p)^{n-1}$ but the extra factor of $ (1-p)$ doesn’t change anything. The question is what’s the “smallest” way to set $ p$ as a function of $ n$ in order to make the above thing go to zero? Using the fact that $ (1-x) < e^{-x}$ for all $ x > 0$, we get
$ n(1-p)^n < ne^{-pn}$
And setting $ p = (\log n) / n$ simplifies the right hand side to $ ne^{- \log n} = n / n = 1$. This is almost what we want, so let’s set $ p$ to be anything that grows asymptotically faster than $ (\log n) / n$. The notation for this is $ \omega((\log n) / n)$. Then using some slick asymptotic notation we can prove that the RHS of the inequality above goes to zero, and so the LHS must as well. Back to the big picture: we just showed that the expectation of $ X$ (the expected number of isolated vertices) goes to zero, and so by the first moment method the value of $ X$ (the actual number of isolated vertices) has to go to zero with probability tending to 1.
Some quick interpretations: when $ p = (\log n) / n$ each vertex has $ \log n$ neighbors in expectation. Moreover, having no isolated vertices is just a little bit short of the entire graph being connected (our ultimate goal is to figure out exactly when this happens). But already we can see that our conjecture from the beginning is probably false: we aren’t able to use this same method to show that when $ p = c/n$ for some constant $ c$ rules out isolated vertices as $ n \to \infty$. We just got lucky in our data analysis that 5 is about the natural log of 100 (which is 4.6).
The second moment method
Now what about the other side of the coin? If $ p$ is asymptotically less than $ (\log n) / n$ do we necessarily get isolated vertices? That would really put our conjecture to rest. In this case the answer is yes, but it might not be in general. Let’s discuss.
We said that in general if $ \mathbb{E}[X] \to 0$ then the value of $ X$ has to go to zero too (that’s the first moment method). The flip side of this is: if $ \mathbb{E}[X] \to \infty$ does necessarily the value of $ X$ also tend to infinity? The answer is not always yes. Here is a gruesome example I originally heard from a book: say $ X$ is the number of people that will die in the next decade due to an asteroid hitting the earth. The probability that the event happens is quite small, but if it does happen then the number of people that will die is quite large. It is perfectly reasonable for this to drag up the expectation (as the world population grows every decade), but at least we hope a growing population doesn’t by itself increase the value of $ X$.
Mathematics is on our side here. We’re asking under what conditions on $ \mathbb{E}[X]$ does the following implication hold: $ \mathbb{E}[X] \to \infty$ implies $ \Pr[X > 0] \to 1$.
With the first moment method we used Markov’s inequality (a statement about expectation, also called the first moment). With the second moment method we’ll use a statement about the second moment (variances), and the most common is Chebyshev’s inequality. Chebyshev’s inequality states that the probability $ X$ deviates from its expectation by more than $ c$ is bounded by $ \textup{Var}[X] / c^2$. In symbols, for all $ c > 0$ we have
Now the opposite of $ X > 0$, written in terms of deviation from expectation, is $ |X – \mathbb{E}[X]| \geq \mathbb{E}[X]$. In words, in order for any number $ a$ to be zero, it has to have a distance of at least $ b$ from any number $ b$. It’s such a stupidly simple statement it’s almost confusing. So then we’re saying that
In order to make this probability go to zero, it’s enough to have $ \textup{Var}[X] = o(\mathbb{E}[X]^2)$. Again, the little-o means “grows asymptotically slower than.” So the numerator of the fraction on the RHS will grow asymptotically slower than the denominator, meaning the whole fraction tends to zero. This condition and its implication are together called the “second moment method.”
Great! So we just need to compute $ \textup{Var}[X]$ and check what conditions on $ p$ make it fit the theorem. Recall that $ \textup{Var}[X] = \mathbb{E}[X^2] – \mathbb{E}[X]^2$, and we want to upper bound this in terms of $ \mathbb{E}[X]^2$. Let’s compute $ \mathbb{E}[X]^2$ first.
Expanding $ X$ as a sum of indicator variables $ X_i$ for each vertex, we can split the square into a sum over pairs. Note that $ X_i^2 = X_i$ since they are 0-1 valued indicator variables, and $ X_iX_j$ is the indicator variable for both events happening simultaneously.
Now we divide by $ \mathbb{E}[X]^2$ to get $ n^{-1}(1-p)^{-n} + (1-p)^{-3} – 1$. Since we’re trying to see if $ p = (\log n) / n$ is a sharp threshold, the natural choice is to let $ p = o((\log n) / n)$. Indeed, using the $ 1-x < e^{-x}$ upper bound and plugging in the little-o bounds the whole quantity by
So we just showed that the property of having no isolated vertices in a random graph has a sharp threshold at $ p = (\log n) / n$. Meaning at any larger probability the graph is almost surely devoid of isolated vertices, and at any lower probability the graph almost surely has some isolated vertices.
This might seem like a miracle theorem, but there turns out to be similar theorems for lots of properties. Most of them you can also prove using basically the same method we’ve been using here. I’ll list some below. Also note they are all sharp, two-sided thresholds in the same way that the isolated vertex boundary is.
The existence of a component of size $ \omega(\log (n))$ has a threshold of $ 1/n$.
$ p = c/n$ for any $ c > 0$ is a threshold for the existence of a giant component of linear size $ \Theta(n)$. Moreover, above this threshold no other components will have size $ \omega(\log n)$.
In addition to $ (\log n) / n$ being a threshold for having no isolated vertices, it is also a threshold for connectivity.
$ p = (\log n + \log \log n + c(n)) / n$ is a sharp threshold for the existence of Hamiltonian cycles in the following sense: if $ c(n) = \omega(1)$ then there will be a Hamilton cycle almost surely, if $ c(n) \to -\infty$ there will be no Hamiltonian cycle almost surely, and if $ c(n) \to c$ the probability of a Hamiltonian cycle is $ e^{-e^{-c}}$. This was proved by Kolmos and Szemeredi in 1983. Moreover, there is an efficient algorithm to find Hamiltonian cycles in these random graphs when they exist with high probability.
Explosive Percolation
So now we know that as the probability of an edge increases, at some point the graph will spontaneously become connected; at some time that is roughly $ \log(n)$ before, the so-called “giant component” will emerge and quickly engulf the entire graph.
Here’s a different perspective on this situation originally set forth by Achlioptas, D’Souza, and Spencer in 2009. It has since become called an “Achlioptas process.”
The idea is that you are watching a random graph grow. Rather than think about random graphs as having a probability above or below some threshold, you can think of it as the number of edges growing (so the thresholds will all be multiplied by $ n$). Then you can imagine that you start with an empty graph, and at every time step someone is adding a new random edge to your graph. Fine, eventually you’ll get so many edges that a giant component emerges and you can measure when that happens.
But now imagine that instead of being given a single random new edge, you are given a choice. Say God presents you with two random edges, and you must pick which to add to your graph. Obviously you will eventually still get a giant component, but the question is how long can you prevent it from occurring? That is, how far back can we push the threshold for connectedness by cleverly selecting the new edge?
What Achlioptas and company conjectured was that you can push it back (some), but that when you push it back as far as it can go, the threshold becomes discontinuous. That is, they believed there was a constant $ \delta \geq 1/2$ such that the size of the largest component jumps from $ o(n)$ to $ \delta n$ in $ o(n)$ steps.
This turned out to be false, and Riordan and Warnke proved it. Nevertheless, the idea has been interpreted in an interesting light. People have claimed it is a useful model of disaster in the following sense. If you imagine that an edge between two vertices is a “crisis” relating two entities. Then in every step God presents you with two crises and you only have the resources to fix one. The idea is that when the entire graph is connected, you have this one big disaster where all the problems are interacting with each other. The percolation process describes how long you can “survive” while avoiding the big disaster.
There are critiques of this interpretation, though, mainly about how simplistic it is. In particular, an Achlioptas process models a crisis as an exogenous force when in reality problems are usually endogenous. You don’t expect a meteor to hit the Earth, but you do expect humans to have an impact on the environment. Also, not everybody in the network is trying to avoid errors. Some companies thrive in economic downturns by managing your toxic assets, for example. So one could reasonably argue that Achlioptas processes aren’t complex enough to model the realistic types of disasters we face.
Either way, I find it fantastic that something like a random graph (which for decades was securely in pure combinatorics away from applications) is spurring such discussion.
Next time, we’ll take one more dive into the theory of Erdős-Rényi random graphs to prove a very “meta” theorem about sharp thresholds. Then we’ll turn our attention to other models of random graphs, hopefully more realistic ones 🙂