Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).
For example, if I have the following three “points” in the plane, as indicated by their colors, which is closer, blue to green, or blue to red?
It’s not obvious, and there are multiple factors at work: the red points have fewer samples, but we can be more certain about the position; the blue points are less certain, but the closest non-blue point to a blue point is green; and the green points are equally plausibly “close to red” and “close to blue.” The centers of masses of the three sample sets are close to an equilateral triangle. In our example the “points” don’t overlap, but of course they could. And in particular, there should probably be a nonzero distance between two points whose sample sets have the same center of mass, as below. The distance quantifies the uncertainty.
All this is to say that it’s not obvious how to define a distance measure that is consistent with perceptual ideas of what geometry and distance should be.
Solution (Earthmover distance): Treat each sample set $ A$ corresponding to a “point” as a discrete probability distribution, so that each sample $ x \in A$ has probability mass $ p_x = 1 / |A|$. The distance between $ A$ and $ B$ is the optional solution to the following linear program.
Each $ x \in A$ corresponds to a pile of dirt of height $ p_x$, and each $ y \in B$ corresponds to a hole of depth $ p_y$. The cost of moving a unit of dirt from $ x$ to $ y$ is the Euclidean distance $ d(x, y)$ between the points (or whatever hipster metric you want to use).
Let $ z_{x, y}$ be a real variable corresponding to an amount of dirt to move from $ x \in A$ to $ y \in B$, with cost $ d(x, y)$. Then the constraints are:
- Each $ z_{x, y} \geq 0$, so dirt only moves from $ x$ to $ y$.
- Every pile $ x \in A$ must vanish, i.e. for each fixed $ x \in A$, $ \sum_{y \in B} z_{x,y} = p_x$.
- Likewise, every hole $ y \in B$ must be completely filled, i.e. $ \sum_{y \in B} z_{x,y} = p_y$.
The objective is to minimize the cost of doing this: $ \sum_{x, y \in A \times B} d(x, y) z_{x, y}$.
In python, using the ortools library (and leaving out a few docstrings and standard import statements, full code on Github):
from ortools.linear_solver import pywraplp def earthmover_distance(p1, p2): dist1 = {x: count / len(p1) for (x, count) in Counter(p1).items()} dist2 = {x: count / len(p2) for (x, count) in Counter(p2).items()} solver = pywraplp.Solver('earthmover_distance', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING) variables = dict() # for each pile in dist1, the constraint that says all the dirt must leave this pile dirt_leaving_constraints = defaultdict(lambda: 0) # for each hole in dist2, the constraint that says this hole must be filled dirt_filling_constraints = defaultdict(lambda: 0) # the objective objective = solver.Objective() objective.SetMinimization() for (x, dirt_at_x) in dist1.items(): for (y, capacity_of_y) in dist2.items(): amount_to_move_x_y = solver.NumVar(0, solver.infinity(), 'z_{%s, %s}' % (x, y)) variables[(x, y)] = amount_to_move_x_y dirt_leaving_constraints[x] += amount_to_move_x_y dirt_filling_constraints[y] += amount_to_move_x_y objective.SetCoefficient(amount_to_move_x_y, euclidean_distance(x, y)) for x, linear_combination in dirt_leaving_constraints.items(): solver.Add(linear_combination == dist1[x]) for y, linear_combination in dirt_filling_constraints.items(): solver.Add(linear_combination == dist2[y]) status = solver.Solve() if status not in [solver.OPTIMAL, solver.FEASIBLE]: raise Exception('Unable to find feasible solution') return objective.Value()
Discussion: I’ve heard about this metric many times as a way to compare probability distributions. For example, it shows up in an influential paper about fairness in machine learning, and a few other CS theory papers related to distribution testing.
One might ask: why not use other measures of dissimilarity for probability distributions (Chi-squared statistic, Kullback-Leibler divergence, etc.)? One answer is that these other measures only give useful information for pairs of distributions with the same support. An example from a talk of Justin Solomon succinctly clarifies what Earthmover distance achieves
Also, why not just model the samples using, say, a normal distribution, and then compute the distance based on the parameters of the distributions? That is possible, and in fact makes for a potentially more efficient technique, but you lose some information by doing this. Ignoring that your data might not be approximately normal (it might have some curvature), with Earthmover distance, you get point-by-point details about how each data point affects the outcome.
This kind of attention to detail can be very important in certain situations. One that I’ve been paying close attention to recently is the problem of studying gerrymandering from a mathematical perspective. Justin Solomon of MIT is a champion of the Earthmover distance (see his fascinating talk here for more, with slides) which is just one topic in a field called “optimal transport.”
This has the potential to be useful in redistricting because of the nature of the redistricting problem. As I wrote previously, discussions of redistricting are chock-full of geometry—or at least geometric-sounding language—and people are very concerned with the apparent “compactness” of a districting plan. But the underlying data used to perform redistricting isn’t very accurate. The people who build the maps don’t have precise data on voting habits, or even locations where people live. Census tracts might not be perfectly aligned, and data can just plain have errors and uncertainty in other respects. So the data that district-map-drawers care about is uncertain much like our point clouds. With a theory of geometry that accounts for uncertainty (and the Earthmover distance is the “distance” part of that), one can come up with more robust, better tools for redistricting.
Solomon’s website has a ton of resources about this, under the names of “optimal transport” and “Wasserstein metric,” and his work extends from computing distances to computing important geometric values like the barycenter, computational advantages like parallelism.
Others in the field have come up with transparency techniques to make it clearer how the Earthmover distance relates to the geometry of the underlying space. This one is particularly fun because the explanations result in a path traveled from the start to the finish, and by setting up the underlying metric in just such a way, you can watch the distribution navigate a maze to get to its target. I like to imagine tiny ants carrying all that dirt.
Finally, work of Shirdhonkar and Jacobs provide approximation algorithms that allow linear-time computation, instead of the worst-case-cubic runtime of a linear solver.
Thanks for the great post, is there a typo in the third constraint on $z_{xy}$? every hole $y \in B$ must be completely filled, i.e. \sum_{y \in B} z_{x,y} = p_x, should it by p_y at the end?
Yes! Thanks for pointing that out, just fixed it.
> Likewise, every hole $y \in B$ must be completely filled
The $y \ and B$ are showing up in escaped form on two browsers here. Everything else is getting formatted.
Thanks! Fixed both
hi. non-math comment/question. the rendering of the math is pretty bad on my linux laptop (with both webkit-based and firefox browsers), but fine on my ipad. mathjax normally renders well on my laptop. i’m curious how you produce your html? no mathjax? cheers.
It’s wordpress’s shitty support for math 🙁 they render serverside and display inline images, which some browsers will put in displaystyle
I’ve been considering migrating to a better platform, but I have resolved to finish writing my book before I do that.
> The distance between S and B is the optional solution
This should be A and B, right?
I would expect an optimal solution to this kind of problem to have some property about encoding of positions. If I apply a function f(x,y) to every point then how is the change in the distance? f is some perturbation, translation, rotation, change of scale, symmetry.
What are the properties of the earth-mover distance with respect to a family of perturbation of the data?
A good distance should be stable with respect to noise with small variance. If you have prior knoledge of the distribution of noise, then the optimal distance seems to have the property that the variance of the distance between the perturbed points is minimal, that is the variance of X = sum(d((xi+nxi,yi+nyi),(xj+nxj,yj+nyj)) is minimal where nxi = noise asociate with xi, for example nxi = random_normal(0,sigma) if the noise doesn’t depend of the coordinates of pointes.
I agree with your sentiment. For rigid transformations of the data it seems clear to me that the distance is unchanged. Scaling all the underlying point distances by the same constant
should scale the distance by
(with discrepancies depending on whether points overlap?)
I don’t any theorems about noise models off-hand, but I would be surprised if small noise caused a big change in the earthmover distance. I tried to look for a few theorems in the literature about this, but the problem is “stability” and “perturbation” result in wildly different measure-theoretic convergence theorems that I think are unrelated to this question, and probably mask more relevant work on this.
Are you going to write further articles about the earthmover’s distance or some other thing on-topic? Because this seems too interesting to leave it at one article.
This post was a break from book writing for me, so I don’t want to promise anything until after I finish my book, hopefully this summer
Noise related: fast and robust earth-movers distance, http://leibniz.cs.huji.ac.il/tr/1143.pdf, they use saturated distance which provide robust property against noise,
d_t (a,b) = min(d(a,b),t), they prove is a distance and they give a fast algorithm with applications.
Thank you for an interesting read. The EM distance was all the rage, about a year ago, in the context of training Generative Adverserial Networks – GANs (google for Wasserstein GAN or WGAN).
EM was used in training GANs for the same reason mentioned here, dealing with distributions that (might) have non-overlapping supports. Interestingly, EM was not calculated directly, but rather via it’s dual problem. I am not an expert on the mathematics, but those interested may find a through and well-written discussion here –
https://vincentherrmann.github.io/blog/wasserstein/
Isn’t a more succinct description of the Earth Mover’s Distance the absolute distance between two cumulative distribution functions?
A cumulative distribution function implies there’s an ordering of the probability space. This is not true of distributions embedded in, say, the plane.
Nice post, I have a question – is there any constraint that the distributions will have the same length? In most of the available implementation I see this is indeed the requirement , which theoretically is not clear to me. Specifically, i’m interested in computing the emd distance between 2 histograms, which may not have the same length (i.e., may not have the same number of bins).
This method should work without the two distributions having the same support. If you have two histograms D, F, then the amount of dirt corresponds to the size of each bin, and the distance between bins is the distance between whatever the x-axis is for your bins (I imagine they are numbers, so it would be something like |x-y|, where x and y are perhaps the midpoint of the bin’s upper and lower range. This may be more complicated if your bins have different sizes).
Thank you very much for this interesting post ! I would like to implement the earth-mover distance in a 3D case, is it possible to use the function you developed in this framework ?