Featured Posts

Making Hybrid Images Neural Networks
and Backpropagation
Elliptic Curves
and Cryptography
bezier-dog-image-small circle-wedge-sphere baby_programmers
Bezier Curves and Picasso Computing Homology Probably Approximately
Correct – A Formal Theory
of Learning

Earthmover Distance

Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).

For example, if I have the following three “points” in the plane, as indicated by their colors, which is closer, blue to green, or blue to red?


It’s not obvious, and there are multiple factors at work: the red points have fewer samples, but we can be more certain about the position; the blue points are less certain, but the closest non-blue point to a blue point is green; and the green points are equally plausibly “close to red” and “close to blue.” The centers of masses of the three sample sets are close to an equilateral triangle. In our example the “points” don’t overlap, but of course they could. And in particular, there should probably be a nonzero distance between two points whose sample sets have the same center of mass, as below. The distance quantifies the uncertainty.


All this is to say that it’s not obvious how to define a distance measure that is consistent with perceptual ideas of what geometry and distance should be.

Solution (Earthmover distance): Treat each sample set A corresponding to a “point” as a discrete probability distribution, so that each sample x \ in A has probability mass p_x = 1 / |A|. The distance between A and B is the optional solution to the following linear program.

Each x \in A corresponds to a pile of dirt of height p_x, and each y \in B corresponds to a hole of depth p_y. The cost of moving a unit of dirt from x to y is the Euclidean distance d(x, y) between the points (or whatever hipster metric you want to use).

Let z_{x, y} be a real variable corresponding to an amount of dirt to move from x \in A to y \in B, with cost d(x, y). Then the constraints are:

  • Each z_{x, y} \geq 0, so dirt only moves from x to y.
  • Every pile x \in A must vanish, i.e. for each fixed x \in A, \sum_{y \in B} z_{x,y} = p_x.
  • Likewise, every hole y \in B must be completely filled, i.e. \sum_{y \in B} z_{x,y} = p_y.

The objective is to minimize the cost of doing this: \sum_{x, y \in A \times B} d(x, y) z_{x, y}.

In python, using the ortools library (and leaving out a few docstrings and standard import statements, full code on Github):

from ortools.linear_solver import pywraplp

def earthmover_distance(p1, p2):
    dist1 = {x: count / len(p1) for (x, count) in Counter(p1).items()}
    dist2 = {x: count / len(p2) for (x, count) in Counter(p2).items()}
    solver = pywraplp.Solver('earthmover_distance', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

    variables = dict()

    # for each pile in dist1, the constraint that says all the dirt must leave this pile
    dirt_leaving_constraints = defaultdict(lambda: 0)

    # for each hole in dist2, the constraint that says this hole must be filled
    dirt_filling_constraints = defaultdict(lambda: 0)

    # the objective
    objective = solver.Objective()

    for (x, dirt_at_x) in dist1.items():
        for (y, capacity_of_y) in dist2.items():
            amount_to_move_x_y = solver.NumVar(0, solver.infinity(), 'z_{%s, %s}' % (x, y))
            variables[(x, y)] = amount_to_move_x_y
            dirt_leaving_constraints[x] += amount_to_move_x_y
            dirt_filling_constraints[y] += amount_to_move_x_y
            objective.SetCoefficient(amount_to_move_x_y, euclidean_distance(x, y))

    for x, linear_combination in dirt_leaving_constraints.items():
        solver.Add(linear_combination == dist1[x])

    for y, linear_combination in dirt_filling_constraints.items():
        solver.Add(linear_combination == dist2[y])

    status = solver.Solve()
    if status not in [solver.OPTIMAL, solver.FEASIBLE]:
        raise Exception('Unable to find feasible solution')

    return objective.Value()

Discussion: I’ve heard about this metric many times as a way to compare probability distributions. For example, it shows up in an influential paper about fairness in machine learning, and a few other CS theory papers related to distribution testing.

One might ask: why not use other measures of dissimilarity for probability distributions (Chi-squared statistic, Kullback-Leibler divergence, etc.)? One answer is that these other measures only give useful information for pairs of distributions with the same support. An example from a talk of Justin Solomon succinctly clarifies what Earthmover distance achieves

Screen Shot 2018-03-03 at 6.11.00 PM.png

Also, why not just model the samples using, say, a normal distribution, and then compute the distance based on the parameters of the distributions? That is possible, and in fact makes for a potentially more efficient technique, but you lose some information by doing this. Ignoring that your data might not be approximately normal (it might have some curvature), with Earthmover distance, you get point-by-point details about how each data point affects the outcome.

This kind of attention to detail can be very important in certain situations. One that I’ve been paying close attention to recently is the problem of studying gerrymandering from a mathematical perspective. Justin Solomon of MIT is a champion of the Earthmover distance (see his fascinating talk here for more, with slides) which is just one topic in a field called “optimal transport.”

This has the potential to be useful in redistricting because of the nature of the redistricting problem. As I wrote previously, discussions of redistricting are chock-full of geometry—or at least geometric-sounding language—and people are very concerned with the apparent “compactness” of a districting plan. But the underlying data used to perform redistricting isn’t very accurate. The people who build the maps don’t have precise data on voting habits, or even locations where people live. Census tracts might not be perfectly aligned, and data can just plain have errors and uncertainty in other respects. So the data that district-map-drawers care about is uncertain much like our point clouds. With a theory of geometry that accounts for uncertainty (and the Earthmover distance is the “distance” part of that), one can come up with more robust, better tools for redistricting.

Solomon’s website has a ton of resources about this, under the names of “optimal transport” and “Wasserstein metric,” and his work extends from computing distances to computing important geometric values like the barycenter, computational advantages like parallelism.

Others in the field have come up with transparency techniques to make it clearer how the Earthmover distance relates to the geometry of the underlying space. This one is particularly fun because the explanations result in a path traveled from the start to the finish, and by setting up the underlying metric in just such a way, you can watch the distribution navigate a maze to get to its target. I like to imagine tiny ants carrying all that dirt.

Screen Shot 2018-03-03 at 6.15.50 PM.png

Finally, work of Shirdhonkar and Jacobs provide approximation algorithms that allow linear-time computation, instead of the worst-case-cubic runtime of a linear solver.

NP-hard does not mean hard

When NP-hardness pops up on the internet, say because some silly blogger wants to write about video games, it’s often tempting to conclude that the problem being proved NP-hard is actually very hard!

“Scientists proved Super Mario is NP-hard? I always knew there was a reason I wasn’t very good at it!” Sorry, these two are unrelated. NP-hardness means hard in a narrow sense this post should hopefully make clear. After that, we’ll explore what “hard” means in a mathematical sense that you can apply beyond NP-hardness to inform your work as a programmer.

When a problem is NP-hard, that simply means that the problem is sufficiently expressive that you can use the problem to express logic. By which I mean boolean formulas using AND, OR, and NOT. In the Super Mario example, the “problem” is a bundle of (1) the controls for the player (2) the allowed tiles and characters that make up a level, and (3) the goal of getting from the start to the end. Logic formulas are encoded in the creation of a level, and solving the problem (completing the level) is the same as finding conditions to make the logical formula true.


The clause gadget for the original Super Mario Brothers, encoding an OR of three variables.

In this sense, NP-hardness doesn’t make all of Super Mario hard. The levels designed to encode logical formulas are contrived, convoluted, and contorted. They abuse the rules of the game in order to cram boolean logic into it. These are worst case levels. It’s using Mario for a completely unintended purpose, not unlike hacking. And so NP-hardness is a worst case claim.

To reiterate, NP-hardness means that Super Mario has expressive power. So expressive that it can emulate other problems we believe are hard in the worst case. And, because the goal of mathematical “hardness” is to reason about the limitations of algorithms, being able to solve Super Mario in full generality implies you can solve any hard subproblem, no matter how ridiculous the level design.

The P != NP conjecture says that there’s no polynomial time algorithm to determine whether boolean logic formulas are satisfiable, and so as a consequence Super Mario (in full generality) also has no polynomial time algorithm.

That being said, in reality Super Mario levels do not encode logical formulas! If you use the knowledge that real-world Super Mario levels are designed in the way they are (to be solvable, fun), then you can solve Super Mario with algorithms. There are many examples.

In general, the difficulty of a problem for humans is unrelated to the difficulty for algorithms. Consider multiplication of integers. This is a trivial problem for computers to solve, but humans tend to struggle with it. It’s an amazing feat to be able to multiply two 7 digit numbers in less than 5 seconds, whereas computers can multiply two thousand-digit numbers in milliseconds.

Meanwhile, protein folding is known to be an NP-hard problem, but it’s been turned into a game sufficiently easy for humans to solve that players have contributed to scientific research. Indeed, even some of the most typically cited NP-hard problems, like traveling salesman, have heuristic, practical algorithmic solutions that allow one to solve them (very close to optimally) in hours on inputs as large as every city on earth.

So the mathematical notions of hardness are quite disconnected from practical notions of hardness. This is not even to mention that some NP-hard problems can be efficiently approximated to within any desired accuracy.

Let’s dig into the math a bit more. “Hardness” is a family of ideas about comparisons between problems based on reusability of algorithmic solutions. Loosely speaking, a problem R is hard with respect to a class of problems C if an algorithm solving R can be easily transformed into an algorithm solving any problem in C. You have to say what kinds of transformations are allowed, and the transformation can be different for different target problems in C, but that’s the basic idea.

In the Super Mario example, if you want to solve logical formulas, you can transform a hypothetically perfect mario-level-playing algorithm into a logic solver by encoding the formula as a level and running the mario-level-playing algorithm on it as a black box. Add an if statement to the end to translate “level can/can’t be finished” to “formula can/can’t be satisfied,” and the transformation is complete. It’s important for NP-hardness that the transformation only takes polynomial time. Other kinds of hardness might admit more or restrict to fewer resources.

And so this is what makes Mario NP-hard, because boolean logic satisfiability is NP-hard. Any problem in NP can be solved by a boolean logic solver, and hence also by a mario-level-player. The fact that boolean logic solving is NP-hard is a difficult theorem to prove. But if we assume it’s true, you can compose the transformations to get from any NP problem to Super Mario.

As a simple example of a different kind of hardness, you can let C be the class of problems solvable using only a finite amount of memory (independent of the input). You have probably heard of this class of problems by another name, but I’ll keep you guessing until the end of the post. A C-hard problem R is one for which an algorithmic solution can be repurposed to solve any finite-memory-solvable problem.

We have to be careful: if the transformation between solutions allows us polynomial time (in the size of the input) like it did for NP-hardness, then we might have enough time in the transformation alone to solve the entire problem, removing the need for a solution to R in the first place! For this reason, we have to limit the amount of work that can be done in the transformation. We get a choice here that influences how interesting or useful the definition of hardness is, but let’s just pick one and say that the transformation can only use finite time (independent of the input).

To be fair, I actually don’t know if there are any hard problems with respect to this definition. There probably are, but chances are good that they are not members of C, and that’s where the definition of hardness gets really interesting. If you have a problem in C which is also C-hard, it’s called complete for C. And once you’ve found a complete problem, from a theoretical perspective you’re a winner. You’ve found a problem which epitomizes the difficulty of solving problems in C. And so it’s a central aim of researchers studying a complexity class to find complete problems. As they say in the business, “ABC: always be completing.”

As a more concrete and interesting example, the class P of all polynomial-time solvable problems has a complete problem. Here the transformations are a bit up in the air. They could either be logarithmic-space computations, or what’s called NC, which can be thought of as poly-logarithmic time (very fast) parallel computations. I only mention NC because it allows you to say “P-complete problems are hard to parallelize.”

Regardless of the choice, there are a number of very useful problems known to be P-complete. The first is the Circuit Value Problem, given a circuit (described by its gates and wires using any reasonable encoding) and an input to the circuit, what is the output?

Others include linear programming (optimize this linear function with respect to linear constraints), data compression (does the compressed version of a string s using Lempel–Ziv–Welch contain a string t?), and type inference for partial types. There are many more in this compendium of Greenlaw et al. Each one is expressive enough to encode any instance of the other, and any instance of any problem in P. It’s quite curious to think that gzip can solve linear programs, but that’s surely no curiouser than super mario levels encoding boolean logic.

Just as with NP-hardness, when a problem is P-hard that doesn’t automatically mean it’s easy or hard for humans, or that typical instances can’t be easily parallelized. P-hardness is also a worst case guarantee.

Studying P-completeness is helpful in the same way NP-completeness is helpful. Completeness informs you about whether you should hope to find a perfect solution or be content with approximations and heuristics (or incorporate problem context to make it easier). Knowing a problem is P-complete means you should not expect perfect efficient parallel algorithms, or perfect efficient algorithms that use severely limited space. Knowing a problem is NP-hard means you should not expect a perfect polynomial time solution. In other words, if you are forced to work with those restrictions, the game becomes one of tradeoffs. Hardness and completeness focus and expedite your work, and clarify a principled decision making process.

Until next time!

P.S. The class of problems solvable in a finite amount of memory is just the class of regular languages. The “finite memory” is the finite state machine used to solve them.

Binary Search on Graphs

Binary search is one of the most basic algorithms I know. Given a sorted list of comparable items and a target item being sought, binary search looks at the middle of the list, and compares it to the target. If the target is larger, we repeat on the smaller half of the list, and vice versa.

With each comparison the binary search algorithm cuts the search space in half. The result is a guarantee of no more than \log(n) comparisons, for a total runtime of O(\log n). Neat, efficient, useful.

There’s always another angle.

What if we tried to do binary search on a graph? Most graph search algorithms, like breadth- or depth-first search, take linear time, and they were invented by some pretty smart cookies. So if binary search on a graph is going to make any sense, it’ll have to use more information beyond what a normal search algorithm has access to.

For binary search on a list, it’s the fact that the list is sorted, and we can compare against the sought item to guide our search. But really, the key piece of information isn’t related to the comparability of the items. It’s that we can eliminate half of the search space at every step. The “compare against the target” step can be thought of a black box that replies to queries of the form, “Is this the thing I’m looking for?” with responses of the form, “Yes,” or, “No, but look over here instead.”


As long as the answers to your queries are sufficiently helpful, meaning they allow you to cut out large portions of your search space at each step, then you probably have a good algorithm on your hands. Indeed, there’s a natural model for graphs, defined in a 2015 paper of Emamjomeh-Zadeh, Kempe, and Singhal that goes as follows.

You’re given as input an undirected, weighted graph G = (V,E), with weights w_e for e \in E. You can see the entire graph, and you may ask questions of the form, “Is vertex v the target?” Responses will be one of two things:

  • Yes (you win!)
  • No, but e = (v, w) is an edge out of v on a shortest path from v to the true target.

Your goal is to find the target vertex with the minimum number of queries.

Obviously this only works if G is connected, but slight variations of everything in this post work for disconnected graphs. (The same is not true in general for directed graphs)

When the graph is a line, this “reduces” to binary search in the sense that the same basic idea of binary search works: start in the middle of the graph, and the edge you get in response to a query will tell you in which half of the graph to continue.


And if we make this example only slightly more complicated, the generalization should become obvious:


Here, we again start at the “center vertex,” and the response to our query will eliminate one of the two halves. But then how should we pick the next vertex, now that we no longer have a linear order to rely on? It should be clear, choose the “center vertex” of whichever half we end up in. This choice can be formalized into a rule that works even when there’s not such obvious symmetry, and it turns out to always be the right choice.

Definition: median of a weighted graph G with respect to a subset of vertices S \subset V is a vertex v \in V (not necessarily in S) which minimizes the sum of distances to vertices in S. More formally, it minimizes

\Phi_S(v) = \sum_{u \in S} d(v, u),

where d(u,v) is the sum of the edge weights along a shortest path from v to u.

And so generalizing binary search to this query-model on a graph results in the following algorithm, which whittles down the search space by querying the median at every step.

Algorithm: Binary search on graphs. Input is a graph G = (V,E).

  • Start with a set of candidates S = V.
  • While we haven’t found the target and |S| > 1:
    • Query the median v of S, and stop if you’ve found the target.
    • Otherwise, let e = (v, w) be the response edge, and compute the set of all vertices x \in V for which e is on a shortest path from v to x. Call this set T.
    • Replace S with S \cap T.
  • Output the only remaining vertex in S

Indeed, as we’ll see momentarily, a python implementation is about as simple. The meat of the work is in computing the median and the set T, both of which are slight variants of Dijkstra’s algorithm for computing shortest paths.

The theorem, which is straightforward and well written by Emamjomeh-Zadeh et al. (only about a half page on page 5), is that this algorithm requires only O(\log(n)) queries, just like binary search.

Before we dive into an implementation, there’s a catch. Even though we are guaranteed only \log(n) many queries, because of our Dijkstra’s algorithm implementation, we’re definitely not going to get a logarithmic time algorithm. So in what situation would this be useful?

Here’s where we use the “theory” trick of making up a fanciful problem and only later finding applications for it (which, honestly, has been quite successful in computer science). In this scenario we’re treating the query mechanism as a black box. It’s natural to imagine that the queries are expensive, and a resource we want to optimize for. As an example the authors bring up in a followup paper, the graph might be the set of clusterings of a dataset, and the query involves a human looking at the data and responding that a cluster should be split, or that two clusters should be joined. Of course, for clustering the underlying graph is too large to process, so the median-finding algorithm needs to be implicit. But the essential point is clear: sometimes the query is the most expensive part of the algorithm.

Alright, now let’s implement it! The complete code is on Github as always.

Always be implementing

We start with a slight variation of Dijkstra’s algorithm. Here we’re given as input a single “starting” vertex, and we produce as output a list of all shortest paths from the start to all possible destination vertices.

We start with a bare-bones graph data structure.

from collections import defaultdict
from collections import namedtuple

Edge = namedtuple('Edge', ('source', 'target', 'weight'))

class Graph:
    # A bare-bones implementation of a weighted, undirected graph
    def __init__(self, vertices, edges=tuple()):
        self.vertices = vertices
        self.incident_edges = defaultdict(list)

        for edge in edges:
                1 if len(edge) == 2 else edge[2]  # optional weight

    def add_edge(self, u, v, weight=1):
        self.incident_edges[u].append(Edge(u, v, weight))
        self.incident_edges[v].append(Edge(v, u, weight))

    def edge(self, u, v):
        return [e for e in self.incident_edges[u] if e.target == v][0]

And then, since most of the work in Dijkstra’s algorithm is tracking information that you build up as you search the graph, we define the “output” data structure, a dictionary of edge weights paired with back-pointers for the discovered shortest paths.

class DijkstraOutput:
    def __init__(self, graph, start):
        self.start = start
        self.graph = graph

        # the smallest distance from the start to the destination v
        self.distance_from_start = {v: math.inf for v in graph.vertices}
        self.distance_from_start[start] = 0

        # a list of predecessor edges for each destination
        # to track a list of possibly many shortest paths
        self.predecessor_edges = {v: [] for v in graph.vertices}

    def found_shorter_path(self, vertex, edge, new_distance):
        # update the solution with a newly found shorter path
        self.distance_from_start[vertex] = new_distance

        if new_distance < self.distance_from_start[vertex]:
            self.predecessor_edges[vertex] = [edge]
        else:  # tie for multiple shortest paths

    def path_to_destination_contains_edge(self, destination, edge):
        predecessors = self.predecessor_edges[destination]
        if edge in predecessors:
            return True
        return any(self.path_to_destination_contains_edge(e.source, edge)
                   for e in predecessors)

    def sum_of_distances(self, subset=None):
        subset = subset or self.graph.vertices
        return sum(self.distance_from_start[x] for x in subset)

The actual Dijkstra algorithm then just does a “breadth-first” (priority-queue-guided) search through G, updating the metadata as it finds shorter paths.

def single_source_shortest_paths(graph, start):
    Compute the shortest paths and distances from the start vertex to all
    possible destination vertices. Return an instance of DijkstraOutput.
    output = DijkstraOutput(graph, start)
    visit_queue = [(0, start)]

    while len(visit_queue) > 0:
        priority, current = heapq.heappop(visit_queue)

        for incident_edge in graph.incident_edges[current]:
            v = incident_edge.target
            weight = incident_edge.weight
            distance_from_current = output.distance_from_start[current] + weight

            if distance_from_current <= output.distance_from_start[v]:
                output.found_shorter_path(v, incident_edge, distance_from_current)
                heapq.heappush(visit_queue, (distance_from_current, v))

    return output

Finally, we implement the median-finding and T-computing subroutines:

def possible_targets(graph, start, edge):
    Given an undirected graph G = (V,E), an input vertex v in V, and an edge e
    incident to v, compute the set of vertices w such that e is on a shortest path from
    v to w.
    dijkstra_output = dijkstra.single_source_shortest_paths(graph, start)
    return set(v for v in graph.vertices
               if dijkstra_output.path_to_destination_contains_edge(v, edge))

def find_median(graph, vertices):
    Compute as output a vertex in the input graph which minimizes the sum of distances
    to the input set of vertices
    best_dijkstra_run = min(
         (single_source_shortest_paths(graph, v) for v in graph.vertices),
         key=lambda run: run.sum_of_distances(vertices)
    return best_dijkstra_run.start

And then the core algorithm

QueryResult = namedtuple('QueryResult', ('found_target', 'feedback_edge'))

def binary_search(graph, query):
    Find a target node in a graph, with queries of the form "Is x the target?"
    and responses either "You found the target!" or "Here is an edge on a shortest
    path to the target."
    candidate_nodes = set(x for x in graph.vertices)  # copy

    while len(candidate_nodes) > 1:
        median = find_median(graph, candidate_nodes)
        query_result = query(median)

        if query_result.found_target:
            return median
            edge = query_result.feedback_edge
            legal_targets = possible_targets(graph, median, edge)
            candidate_nodes = candidate_nodes.intersection(legal_targets)

    return candidate_nodes.pop()

Here’s an example of running it on the example graph we used earlier in the post:

Graph looks like this tree, with uniform weights

    a       k
     b     j
     d     l
    e       m
G = Graph(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
           'j', 'k', 'l', 'm'],
               ('a', 'b'),
               ('b', 'c'),
               ('c', 'd'),
               ('d', 'e'),
               ('c', 'f'),
               ('f', 'g'),
               ('g', 'h'),
               ('h', 'i'),
               ('i', 'j'),
               ('j', 'k'),
               ('i', 'l'),
               ('l', 'm'),

def simple_query(v):
    ans = input("is '%s' the target? [y/N] " % v)
    if ans and ans.lower()[0] == 'y':
        return QueryResult(True, None)
        print("Please input a vertex on the shortest path between"
              " '%s' and the target. The graph is: " % v)
        for w in G.incident_edges:
            print("%s: %s" % (w, G.incident_edges[w]))

        target = None
        while target not in G.vertices:
            target = input("Input neighboring vertex of '%s': " % v)

    return QueryResult(
        G.edge(v, target)

output = binary_search(G, simple_query)
print("Found target: %s" % output)

The query function just prints out a reminder of the graph and asks the user to answer the query with a yes/no and a relevant edge if the answer is no.

An example run:

is 'g' the target? [y/N] n
Please input a vertex on the shortest path between 'g' and the target. The graph is:
e: [Edge(source='e', target='d', weight=1)]
i: [Edge(source='i', target='h', weight=1), Edge(source='i', target='j', weight=1), Edge(source='i', target='l', weight=1)]
g: [Edge(source='g', target='f', weight=1), Edge(source='g', target='h', weight=1)]
l: [Edge(source='l', target='i', weight=1), Edge(source='l', target='m', weight=1)]
k: [Edge(source='k', target='j', weight=1)]
j: [Edge(source='j', target='i', weight=1), Edge(source='j', target='k', weight=1)]
c: [Edge(source='c', target='b', weight=1), Edge(source='c', target='d', weight=1), Edge(source='c', target='f', weight=1)]
f: [Edge(source='f', target='c', weight=1), Edge(source='f', target='g', weight=1)]
m: [Edge(source='m', target='l', weight=1)]
d: [Edge(source='d', target='c', weight=1), Edge(source='d', target='e', weight=1)]
h: [Edge(source='h', target='g', weight=1), Edge(source='h', target='i', weight=1)]
b: [Edge(source='b', target='a', weight=1), Edge(source='b', target='c', weight=1)]
a: [Edge(source='a', target='b', weight=1)]
Input neighboring vertex of 'g': f
is 'c' the target? [y/N] n
Please input a vertex on the shortest path between 'c' and the target. The graph is:
Input neighboring vertex of 'c': d
is 'd' the target? [y/N] n
Please input a vertex on the shortest path between 'd' and the target. The graph is:
Input neighboring vertex of 'd': e
Found target: e

A likely story

The binary search we implemented in this post is pretty minimal. In fact, the more interesting part of the work of Emamjomeh-Zadeh et al. is the part where the response to the query can be wrong with some unknown probability.

In this case, there can be many shortest paths that are valid responses to a query, in addition to all the invalid responses. In particular, this rules out the strategy of asking the same query multiple times and taking the majority response. If the error rate is 1/3, and there are two shortest paths to the target, you can get into a situation in which you see three responses equally often and can’t choose which one is the liar.

Instead, the technique Emamjomeh-Zadeh et al. use is based on the Multiplicative Weights Update Algorithm (it strikes again!). Each query gives a multiplicative increase (or decrease) on the set of nodes that are consistent targets under the assumption that query response is correct. There are a few extra details and some postprocessing to avoid unlikely outcomes, but that’s the basic idea. Implementing it would be an excellent exercise for readers interested in diving deeper into a recent research paper (or to flex their math muscles).

But even deeper, this model of “query and get advice on how to improve” is a classic  learning model first formally studied by Dana Angluin (my academic grand-advisor). In her model, one wants to design an algorithm to learn a classifier. The allowed queries are membership and equivalence queries. A membership is essentially, “What’s its label of this element?” and an equivalence query has the form, “Is this the right classifier?” If the answer is no, a mislabeled example is provided.

This is different from the usual machine learning assumption, because the learning algorithm gets to construct an example it wants to get more information about, instead of simply relying on a randomly generated subset of data. The goal is to minimize the number of queries before the target hypothesis is learned exactly. And indeed, as we saw in this post, if you have a little extra time to analyze the problem space, you can craft queries that extract quite a lot of information.

Indeed, the model we presented here for binary search on graphs is the natural analogue of an equivalence query for a search problem: instead of a mislabeled counterexample, you get a nudge in the right direction toward the target. Pretty neat!

There are a few directions we could take from here: (1) implement the Multiplicative Weights version of the algorithm, (2) apply this technique to a problem like ranking or clustering, or (3) cover theoretical learning models like membership and equivalence queries in more detail. What interests you?

Until next time!

Linear Programming and Healthy Diets — Part 2

Previously in this series:

Foods of the Father

My dad’s an interesting guy.

Every so often he picks up a health trend and/or weight loss goal that would make many people’s jaw drop. For example, we once went on a 5-day, 50-mile backpacking trip in the Grand Tetons, and my dad brought one of these per day for dinner, and had vitamin tablets for the rest of his sustenance. The rest of us planned for around 3,000 calories per day. He’s tried the “high fat” and “no fat” diets, and quite a few others. He’s concerned with losing weight, but also living longer, so he’s into caloric restriction among other things.

Recently he asked me to help him optimize his diet. He described a scheme he was performing by hand to minimize the number of calories he consumed per day while maintaining the minimum nutrients required by the FDA’s recommendations. He had a spreadsheet with the nutrients for each food, and a spreadsheet with the constraints on each nutrient. He wanted to come up with a collection of meals (or just throw all the food into a blender) that taste within reason but meet these criteria.

He was essentially solving a linear program by hand, roughly as best as one can, with a few hundred variables! After asking me whether there was “any kind of math” that could help him automate his laborious efforts, I decided to lend a hand. After all, it’s been over three years since I promised my readers I’d apply linear programming to optimize a diet (though it was optimizing for cost rather than calories).

Though it never went beyond what linear programming can handle, pretty quickly my dad’s requests specialized beyond what would interest a general reader. Perhaps this is the nature of math consulting, but it seems when you give someone what they want, they realize that’s not what they wanted.

But the basic ideas are still relevant enough. My solution is a hundred-ish lines of python to set up the input, using Google’s open source operations research tools as the core solver. Disclaimer: I work for Google but I don’t work on the team that wrote this tool. Also nothing in this post represents the views of my employer. It’s just me, and the scale of this problem is laughable for Google to care about anyway.

So this post is half tutorial showing how to use the or-tools python wrapper (it’s only somewhat documented), and half showing a realistic use case for linear programming.

However, don’t let this post dissuade you from the belief that linear programming is useful beyond dieting. People use linear programming to solve all kinds of interesting problems. Here are a few I came across in just the last few weeks:

And that’s not even to mention the ubiquitous applications in operations research (network flow, production optimization, economics) that every large company relies on. The applications seem endless!

As usual, all of the code and data we use in the making of this post is available at this blog’s Github page.

Update 2018-01-01: With this code my dad had tried a few inadvisable cooking techniques: take all the ingredients and throw them in an omelet, or blend them all together in a smoothie. Something about cooking the food alters the nutritional content, so he claims he needed to eat them more or less raw. The resulting “meals” were so unpalatable that he appears to have given up on the optimization techniques in this post. It seems the extreme end of the taste/health tradeoff is not where he wants to be. This suggests an open problem: find a good way to model (or lean from data) what foods taste good together, and in what quantities. One might be able to learn from a corpus of recipes, though I imagine that can only go so far for lightly-cooked ingredients. But with hypothetical constraints like, “penalize/prefer these foods being in the same meal”, one might be able to quantify the taste/health tradeoff in a way that makes my dad happy. Having an easy way to slide along the scale (rather than just naively optimize) would also potentially be useful.


If you remember how linear programs work, you can safely skip this section.

As a refresher, let’s outline how to model the nutrition problem as a linear program and recall the basic notation. The variables are food in 100 gram increments. So x_1 might be the amount of canned peas consumed, x_2 lobster meat, etc. All variables would have to be nonnegative, of course. The objective is to minimize the total number of calories consumed. If c_1 \geq 0 is the amount of calories in 100g of canned peas, then one would pay c_1x_1 in calories contributed by peas. If we have n different variables, then the objective function is the linear combination

\textup{cost}(\mathbf{x}) = \sum_{j=1}^n c_j x_j

We’re using boldface \mathbf{x} to represent the vector (x_1, \dots, x_n). Likewise, \mathbf{c} will represent the cost vector of foods (c_1, \dots, c_n). As we’ve seen many times, we can compactly write the sum above as an inner product \langle \mathbf{c}, \mathbf{x} \rangle.

Finally, we require that the amount of each nutrient combined in the stuff we buy meets some threshold. So for each nutrient we have a constraint. The easiest one is calories; we require the total number of calories consumed is at least (say) 2,000. So if a_j represents the number of calories in food j, we require \sum_{j=1}^n a_j x_j \geq 2000. We might also want to restrict a maximum number of calories, but in general having a diet with more calories implies higher cost, and so when the linear program minimizes cost we should expect it not to produce a diet with significantly more than 2,000 calories.

Since we have one set of nutrient information for each pair of (nutrient, food), we need to get fancier with the indexing. I’ll call a_{i,j} the amount of nutrient i in food j. Note that A = (a_{i,j}) will be a big matrix, so I’m saying that nutrients i represent the rows of the matrix and foods j represent the columns. That is, each row of the matrix represents the amount of one specific nutrient in all the foods, and each column represents the nutritional content of a single food. We’ll always use n to denote the number of foods, and m to denote the number of nutrients.

Finally, we have a lower and upper bound for each nutrient, which behind the scenes are converted into lower bounds (possibly negating the variables). This isn’t required to write the program, as we’ll see. In notation, we require that for every 1 \leq i \leq m, the nutrient constraint \sum_{j=1}^n a_{i,j} x_j \geq b_i is satisfied. If we again use vector notation for the constraints \mathbf{b}, we can write the entire set of constraints as a “matrix equation”

A \mathbf{x} \geq \mathbf{b}

And this means each entry of the vector you get from multiplying the left-hand-side is greater than the corresponding entry on the right-hand-side. So the entire linear program is summarized as follows

\displaystyle \begin{aligned} \textup{min } & \langle \mathbf{c} , \mathbf{x} \rangle  \\ \textup{such that } & A \mathbf{x}  \geq \mathbf{b} \\ & \mathbf{x}  \geq \mathbf{0} \end{aligned}

That’s the syntactical form of our linear program. Now all (!) we need to do is pick a set of foods and nutrients, and fill in the constants for A, \mathbf{c}, \mathbf{b}.

Nutrients and Foods

The easier of the two pieces of data is the nutrient constraints. The system used in the United States is called the Dietary Reference Intake system. It consists of five parts, which I’ve paraphrased from Wikipedia.

  • Estimated Average Requirements (EAR), expected to satisfy the needs of 50% of the people in an age group.
  • Recommended Dietary Allowances (RDA), the daily intake level considered sufficient to meet the requirements of 97.5% of healthy individuals (two standard deviations).
  • Adequate Intake (AI), where no RDA has been established. Meant to complement the RDA, but has less solid evidence.
  • Tolerable upper intake levels (UL), the highest level of daily consumption that have not shown harmful side effects.

While my dad come up with his own custom set of constraints, the ones I’ve posted on the github repository are essentially copy/paste from the current RDA/AI as a lower bound, with the UL as an upper bound. The values I selected are in a csv. Missing values in the upper bound column mean there is no upper bound. And sorry ladies, since it’s for my dad I chose the male values. Women have slightly different values due to different average size/weight.

Nutrient values for food are a little bit harder, because nutrient data isn’t easy to come by. There are a few databases out there, all of which are incomplete, and some of which charge for use. My dad spent a long time hunting down the nutrients (he wanted some additional special nutrients) for his top 200-odd foods.

Instead, in this post we’ll use the USDA’s free-to-use database of 8,000+ foods. It comes in a single, abbreviated, oddly-formatted text file which I’ve parsed into a csv and chosen an arbitrary subset of 800-ish foods to play around with.

Python OR Tools

Google’s ortools docs ask you to download a tarball to install their package, but I found that was unnecessary (perhaps it’s required to attach a third-party solver to their interface?). Instead, one can just pip install it.

pip3 install py3-ortools

Then in a python script, you can import the ortools library and create a simple linear program:

from ortools.linear_solver import pywraplp

solver = pywraplp.Solver('my_LP', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

x = solver.NumVar(0, 10, 'my first variable')
y = solver.NumVar(0, 10, 'my second variable')

solver.Add(x + y <= 7) solver.Add(x - 2 * y >= -2)

objective = solver.Objective()
objective.SetCoefficient(x, 3)
objective.SetCoefficient(y, 1)

status = solver.Solve()

if status not in [solver.OPTIMAL, solver.FEASIBLE]:
    raise Exception('Unable to find feasible solution')


This provides the basic idea of the library. You can use python’s operator overloading (to an extent) to make the constraints look nice in the source code.

Setting up the food LP

The main file diet_optimizer.py contains a definition for a class, which, in addition to loading the data, encapsulates all the variables and constraints.

class DietOptimizer(object):
    def __init__(self, nutrient_data_filename='nutrients.csv',

        self.food_table = # load data into a list of dicts
        self.constraints_data = # load data into a list of dicts

        self.solver = pywraplp.Solver('diet_optimizer', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

        self.objective = self.solver.Objective()
        for row in self.food_table:
            name = row['description']
            var = self.variable_dict[name]
            calories_in_food = row[calories_name]
            self.objective.SetCoefficient(var, calories_in_food)

We’ll get into the variables and constraints momentarily, but before that we can see the solve method

    def solve(self):
            Return a dictionary with 'foods' and 'nutrients' keys representing
            the solution and the nutrient amounts for the chosen diet
        status = self.solver.Solve()
        if status not in [self.solver.OPTIMAL, self.solver.FEASIBLE]:
            raise Exception('Unable to find feasible solution')

        chosen_foods = {
            food_name: var.solution_value()
            for food_name, var in self.variable_dict.items() if var.solution_value() > 1e-10

        self.chosen_foods = chosen_foods

        nutrients = {
            row['nutrient']: self.nutrients_in_diet(chosen_foods, row['nutrient'])
            for row in self.constraints_table

        return {
            'foods': chosen_foods,
            'nutrients': nutrients,

Here nutrients_in_diet is a helper function which, given a dictionary of foods and a nutrient, outputs the nutrient contents for that food. This can be used independently of the solver to evaluate the nutrient contents of a proposed diet.

Next we have the method to create the variables.

    def create_variable_dict(self):
            The variables are the amount of each food to include, denominated in units of 100g
        self.variable_dict = dict(
            (row['description'], self.solver.NumVar(0, 10, row['description']))
            for row in self.food_table

Each food must be present in a nonnegative amount, and I’ve imposed a cap of 10 (1kg) for any individual food. The reason for this is that I originally had a “water” constraint, and the linear program decided to optimize for that by asking one to drink 2L of red wine per day. I neglected to put in an alcohol nutrient (because it was not already there and I’m lazy), so instead I limited the amount of any individual food. It still seems like a reasonable constraint to impose that nobody would want to eat more than 1kg of any single food on one day.

Finally, we can construct the constraints. The core method takes a nutrient and a lower and upper bound:

    def create_constraint(self, nutrient_name, lower, upper):
            Each constraint is a lower and upper bound on the
            sum of all food variables, scaled by how much of the
            relevant nutrient is in that food.
        if not lower:
            print('Warning! Nutrient %s has no lower bound!'.format(nutrient_name))

        sum_of_foods = self.foods_for_nutrient(nutrient_name)
        constraint_lb = lower <= sum_of_foods
        self.constraint_dict[nutrient_name + ' (lower bound)'] = constraint_lb

        if not upper:
            return  # no upper bound in the data

        constraint_ub = sum_of_foods <= upper
        self.constraint_dict[nutrient_name + ' (upper bound)'] = constraint_ub

This method is mostly bookkeeping, while foods_for_nutrient does the individual nutrient lookup. Note that one is not allowed to do a double-ended inequality like self.solver.Add(lower <= sum_of_foods <= upper). If you try, ortools will ignore one end of the bound.

    def foods_for_nutrient(self, nutrient_name, scale_by=1.0):
        # a helper function that computes the scaled sum of all food variables
        # for a given nutrient
        relevant_foods = []
        for row in self.food_table:
            var = self.variable_dict[row['description']]
            nutrient_amount = row[nutrient_name]
            if nutrient_amount > 0:
                relevant_foods.append(scale_by * nutrient_amount * var)

        if len(relevant_foods) == 0:
            print('Warning! Nutrient %s has no relevant foods!'.format(nutrient_name))

        return SumArray(relevant_foods)

Here we are a bit inefficient by iterating through the entire table, instead of just those foods containing the nutrient in question. But there are only a few hundred foods in our sample database (8,000 if you use the entire SR28 database), and so the optimization isn’t necessary.

Also note that while ortools allows one to use the sum method, it does so in a naive way, because sum([a, b, c]) becomes ((a + b) + c), which is a problem because if the list is too long their library exceeds Python’s default recursion limit. Instead we construct a SumArray by hand.

Finally, though we omitted it here for simplicity, throughout the code in the Github repository you’ll see references to percent_constraints. This exists because some nutrients, like fat, are recommended to be restricted to a percentage of calories, not an absolute amount. So we define a mechanism to specify a nutrient should be handled with percents, and a mapping from grams to calories. This ends up using the scale_by parameter above, both to scale fat by 9 calories per gram, and to scale calories to be a percentage. Cf. the special function for creating percent constraints.

Finally, we have methods just for pretty-printing the optimization problem and the solution, called summarize_optimization_problem and summarize_solution, respectively.

Running the solver

Invoking the solver is trivial.

if __name__ == "__main__":
    solver = DietOptimizer()
    # solver.summarize_optimization_problem()
    solution = solver.solve()

With the example foods and constraints in the github repo, the result is:


   38.5g: CURRY POWDER

Nutrient totals

    1700.0 mg   calcium                   [1700.0, 2100.0]
     130.0 g    carbohydrate              [130.0, ]
     550.0 mg   choline                   [550.0, 3500.0]
       3.3 mg   copper                    [0.9, 10.0]
      60.5 g    dietary fiber             [38.0, ]
     549.7 μg   dietary folate            [400.0, 1000.0]
    1800.0 kcal energy                    [1800.0, 2100.0]
      32.4 mg   iron                      [18.0, 45.0]
     681.7 mg   magnesium                 [420.0, ]
       7.3 mg   manganese                 [2.3, 11.0]
      35.0 mg   niacin                    [16.0, 35.0]
      11.7 mg   pantothenic acid          [5.0, ]
    2554.3 mg   phosphorus                [1250.0, 4000.0]
      14.0 g    polyunsaturated fatty acids  [1.6, 16.0]
    4700.0 mg   potassium                 [4700.0, ]
     165.2 g    protein                   [56.0, ]
       2.8 mg   riboflavin                [1.3, ]
     220.8 μg   selenium                  [55.0, 400.0]
    1500.0 mg   sodium                    [1500.0, 2300.0]
       2.4 mg   thiamin                   [1.2, ]
      59.4 g    total fat                 [20.0, 35.0]        (29.7% of calories)
    3000.0 μg   vitamin a                 [900.0, 3000.0]
      23.0 μg   vitamin b12               [2.4, ]
       2.4 mg   vitamin b6                [1.7, 100.0]
     157.6 mg   vitamin c                 [90.0, 2000.0]
     893.0 iu   vitamin d                 [400.0, 4000.0]
      15.0 mg   vitamin e                 [15.0, 1000.0]
     349.4 μg   vitamin k                 [120.0, ]
      17.2 mg   zinc                      [11.0, 40.0]

Unfortunately, this asks for a kilo of raw alfalfa sprouts, which I definitely would not eat. The problem is that alfalfa is ridiculously nutritious. Summarizing the diet with the print_details flag set, we see they contribute a significant amount of nearly every important nutrient.

	18.8% of calcium (mg)
	16.2% of carbohydrate (g)
	26.2% of choline (mg)
	47.3% of copper (mg)
	31.4% of dietary fiber (g)
	65.5% of dietary folate (μg)
	12.8% of energy (kcal)
	29.7% of iron (mg)
	39.6% of magnesium (mg)
	25.6% of manganese (mg)
	13.7% of niacin (mg)
	48.2% of pantothenic acid (mg)
	27.4% of phosphorus (mg)
	29.3% of polyunsaturated fatty acids (g)
	16.8% of potassium (mg)
	24.2% of protein (g)
	45.1% of riboflavin (mg)
	2.7% of selenium (μg)
	4.0% of sodium (mg)
	31.9% of thiamin (mg)
	11.6% of total fat (g)
	2.7% of vitamin a (μg)
	13.9% of vitamin b6 (mg)
	52.0% of vitamin c (mg)
	1.3% of vitamin e (mg)
	87.3% of vitamin k (μg)
	53.5% of zinc (mg)

But ignoring that, we have some reasonable sounding foods: fish, sweet potato, rosemary (okay that’s a ton of rosemary), egg and wine. I bet someone could make a tasty meal from those rough ingredients.

Extensions and Exercises

No tutorial would be complete without exercises. All of these are related to the actual linear program modeling problem.

Food groups: Suppose you had an additional column for each food called “food group.” You want to create a balanced meal, so you add additional constraint for each food group requiring some food, but not too much, from each group. Furthermore, for certain foods like spices, one could add a special constraint for each spice requiring not more than, say, 20g of any given spice. Or else, as one can see, the linear program can produce diets involving obscenely large amounts of spices.

Starting from a given set of foods: Supposing you have an idea for a recipe (or couple of recipes for a day’s meals), but you want to add whatever else is needed to make it meet the nutritional standards. Modify the LP to allow for this.

Integer variations: The ortools package supports integer programming as well. All you need to do to enable this is change the solver type to CBC_MIXED_INTEGER_PROGRAMMING. The solver will run as normal, and now you can create integer-valued variables using solver.IntVar instead of NumVar. Using binary variables, one can define logical OR constraints (figure out how this must work). Define a new binary variable for each food, and define a constraint that makes this variable 0 when the food is not used, and 1 when the food is used. Then add a term to the optimization problem that penalizes having too many different foods in a daily diet.

(Harder) Sampling: Part of the motivation for this project is to come up with a number of different dishes that are all “good” with respect to this optimization problem. Perhaps there is more than one optimal solution, or perhaps there are qualitatively different diets that are close enough to optimal. However, this implementation produces a deterministic output. Find a way to introduce randomness into the program, so that you can get more than one solution.

Feel free to suggest other ideas, and extend or rewrite the model to do something completely different. The sky’s the limit!