Making Hybrid Images | Neural Networks and Backpropagation |
Elliptic Curves and Cryptography |

Bezier Curves and Picasso | Computing Homology | Probably Approximately Correct – A Formal Theory of Learning |

# Binary Search on Graphs

Binary search is one of the most basic algorithms I know. Given a sorted list of comparable items and a target item being sought, binary search looks at the middle of the list, and compares it to the target. If the target is larger, we repeat on the smaller half of the list, and vice versa.

With each comparison the binary search algorithm cuts the search space in half. The result is a guarantee of no more than comparisons, for a total runtime of . Neat, efficient, useful.

There’s always another angle.

What if we tried to do binary search on a graph? Most graph search algorithms, like breadth- or depth-first search, take linear time, and they were invented by some pretty smart cookies. So if binary search on a graph is going to make any sense, it’ll have to use more information beyond what a normal search algorithm has access to.

For binary search on a list, it’s the fact that the list is sorted, and we can compare against the sought item to guide our search. But really, the key piece of information isn’t related to the comparability of the items. It’s that we can eliminate half of the search space at every step. The “compare against the target” step can be thought of a black box that replies to queries of the form, “Is this the thing I’m looking for?” with responses of the form, “Yes,” or, “No, but look over here instead.”

As long as the answers to your queries are sufficiently helpful, meaning they allow you to cut out large portions of your search space at each step, then you probably have a good algorithm on your hands. Indeed, there’s a natural model for graphs, defined in a 2015 paper of Emamjomeh-Zadeh, Kempe, and Singhal that goes as follows.

You’re given as input an undirected, weighted graph , with weights for . You can see the entire graph, and you may ask questions of the form, “Is vertex the target?” Responses will be one of two things:

- Yes (you win!)
- No, but is an edge out of on a shortest path from to the true target.

Your goal is to find the target vertex with the minimum number of queries.

Obviously this only works if is connected, but slight variations of everything in this post work for disconnected graphs. (The same is not true in general for directed graphs)

When the graph is a line, this “reduces” to binary search in the sense that the same basic idea of binary search works: start in the middle of the graph, and the edge you get in response to a query will tell you in which half of the graph to continue.

And if we make this example only slightly more complicated, the generalization should become obvious:

Here, we again start at the “center vertex,” and the response to our query will eliminate one of the two halves. But then how should we pick the next vertex, now that we no longer have a linear order to rely on? It should be clear, choose the “center vertex” of whichever half we end up in. This choice can be formalized into a rule that works even when there’s not such obvious symmetry, and it turns out to always be the right choice.

**Definition: **A *median *of a weighted graph with respect to a subset of vertices is a vertex (not necessarily in ) which minimizes the sum of distances to vertices in . More formally, it minimizes

,

where is the sum of the edge weights along a shortest path from to .

And so generalizing binary search to this query-model on a graph results in the following algorithm, which whittles down the search space by querying the median at every step.

**Algorithm: **Binary search on graphs. Input is a graph .

- Start with a set of candidates .
- While we haven’t found the target and :
- Query the median of , and stop if you’ve found the target.
- Otherwise, let be the response edge, and compute the set of all vertices for which is on a shortest path from to . Call this set .
- Replace with .

- Output the only remaining vertex in

Indeed, as we’ll see momentarily, a python implementation is about as simple. The meat of the work is in computing the median and the set , both of which are slight variants of Dijkstra’s algorithm for computing shortest paths.

The theorem, which is straightforward and well written by Emamjomeh-Zadeh et al. (only about a half page on page 5), is that this algorithm requires only queries, just like binary search.

Before we dive into an implementation, there’s a catch. Even though we are guaranteed only many queries, because of our Dijkstra’s algorithm implementation, we’re definitely not going to get a logarithmic time algorithm. So in what situation would this be useful?

Here’s where we use the “theory” trick of making up a fanciful problem and only later finding applications for it (which, honestly, has been quite successful in computer science). In this scenario we’re treating the query mechanism as a black box. It’s natural to imagine that the queries are expensive, and a resource we want to optimize for. As an example the authors bring up in a followup paper, the graph might be the set of clusterings of a dataset, and the query involves a human looking at the data and responding that a cluster should be split, or that two clusters should be joined. Of course, for clustering the underlying graph is too large to process, so the median-finding algorithm needs to be implicit. But the essential point is clear: sometimes the query is the most expensive part of the algorithm.

Alright, now let’s implement it! The complete code is on Github as always.

## Always be implementing

We start with a slight variation of Dijkstra’s algorithm. Here we’re given as input a single “starting” vertex, and we produce as output a list of all shortest paths from the start to all possible destination vertices.

We start with a bare-bones graph data structure.

from collections import defaultdict from collections import namedtuple Edge = namedtuple('Edge', ('source', 'target', 'weight')) class Graph: # A bare-bones implementation of a weighted, undirected graph def __init__(self, vertices, edges=tuple()): self.vertices = vertices self.incident_edges = defaultdict(list) for edge in edges: self.add_edge( edge[0], edge[1], 1 if len(edge) == 2 else edge[2] # optional weight ) def add_edge(self, u, v, weight=1): self.incident_edges[u].append(Edge(u, v, weight)) self.incident_edges[v].append(Edge(v, u, weight)) def edge(self, u, v): return [e for e in self.incident_edges[u] if e.target == v][0]

And then, since most of the work in Dijkstra’s algorithm is tracking information that you build up as you search the graph, we define the “output” data structure, a dictionary of edge weights paired with back-pointers for the discovered shortest paths.

class DijkstraOutput: def __init__(self, graph, start): self.start = start self.graph = graph # the smallest distance from the start to the destination v self.distance_from_start = {v: math.inf for v in graph.vertices} self.distance_from_start[start] = 0 # a list of predecessor edges for each destination # to track a list of possibly many shortest paths self.predecessor_edges = {v: [] for v in graph.vertices} def found_shorter_path(self, vertex, edge, new_distance): # update the solution with a newly found shorter path self.distance_from_start[vertex] = new_distance if new_distance < self.distance_from_start[vertex]: self.predecessor_edges[vertex] = [edge] else: # tie for multiple shortest paths self.predecessor_edges[vertex].append(edge) def path_to_destination_contains_edge(self, destination, edge): predecessors = self.predecessor_edges[destination] if edge in predecessors: return True return any(self.path_to_destination_contains_edge(e.source, edge) for e in predecessors) def sum_of_distances(self, subset=None): subset = subset or self.graph.vertices return sum(self.distance_from_start[x] for x in subset)

The actual Dijkstra algorithm then just does a “breadth-first” (priority-queue-guided) search through , updating the metadata as it finds shorter paths.

def single_source_shortest_paths(graph, start): ''' Compute the shortest paths and distances from the start vertex to all possible destination vertices. Return an instance of DijkstraOutput. ''' output = DijkstraOutput(graph, start) visit_queue = [(0, start)] while len(visit_queue) > 0: priority, current = heapq.heappop(visit_queue) for incident_edge in graph.incident_edges[current]: v = incident_edge.target weight = incident_edge.weight distance_from_current = output.distance_from_start[current] + weight if distance_from_current <= output.distance_from_start[v]: output.found_shorter_path(v, incident_edge, distance_from_current) heapq.heappush(visit_queue, (distance_from_current, v)) return output

Finally, we implement the median-finding and -computing subroutines:

def possible_targets(graph, start, edge): ''' Given an undirected graph G = (V,E), an input vertex v in V, and an edge e incident to v, compute the set of vertices w such that e is on a shortest path from v to w. ''' dijkstra_output = dijkstra.single_source_shortest_paths(graph, start) return set(v for v in graph.vertices if dijkstra_output.path_to_destination_contains_edge(v, edge)) def find_median(graph, vertices): ''' Compute as output a vertex in the input graph which minimizes the sum of distances to the input set of vertices ''' best_dijkstra_run = min( (single_source_shortest_paths(graph, v) for v in graph.vertices), key=lambda run: run.sum_of_distances(vertices) ) return best_dijkstra_run.start

And then the core algorithm

QueryResult = namedtuple('QueryResult', ('found_target', 'feedback_edge')) def binary_search(graph, query): ''' Find a target node in a graph, with queries of the form "Is x the target?" and responses either "You found the target!" or "Here is an edge on a shortest path to the target." ''' candidate_nodes = set(x for x in graph.vertices) # copy while len(candidate_nodes) > 1: median = find_median(graph, candidate_nodes) query_result = query(median) if query_result.found_target: return median else: edge = query_result.feedback_edge legal_targets = possible_targets(graph, median, edge) candidate_nodes = candidate_nodes.intersection(legal_targets) return candidate_nodes.pop()

Here’s an example of running it on the example graph we used earlier in the post:

''' Graph looks like this tree, with uniform weights a k b j cfghi d l e m ''' G = Graph(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm'], [ ('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('c', 'f'), ('f', 'g'), ('g', 'h'), ('h', 'i'), ('i', 'j'), ('j', 'k'), ('i', 'l'), ('l', 'm'), ]) def simple_query(v): ans = input("is '%s' the target? [y/N] " % v) if ans and ans.lower()[0] == 'y': return QueryResult(True, None) else: print("Please input a vertex on the shortest path between" " '%s' and the target. The graph is: " % v) for w in G.incident_edges: print("%s: %s" % (w, G.incident_edges[w])) target = None while target not in G.vertices: target = input("Input neighboring vertex of '%s': " % v) return QueryResult( False, G.edge(v, target) ) output = binary_search(G, simple_query) print("Found target: %s" % output)

The query function just prints out a reminder of the graph and asks the user to answer the query with a yes/no and a relevant edge if the answer is no.

An example run:

is 'g' the target? [y/N] n Please input a vertex on the shortest path between 'g' and the target. The graph is: e: [Edge(source='e', target='d', weight=1)] i: [Edge(source='i', target='h', weight=1), Edge(source='i', target='j', weight=1), Edge(source='i', target='l', weight=1)] g: [Edge(source='g', target='f', weight=1), Edge(source='g', target='h', weight=1)] l: [Edge(source='l', target='i', weight=1), Edge(source='l', target='m', weight=1)] k: [Edge(source='k', target='j', weight=1)] j: [Edge(source='j', target='i', weight=1), Edge(source='j', target='k', weight=1)] c: [Edge(source='c', target='b', weight=1), Edge(source='c', target='d', weight=1), Edge(source='c', target='f', weight=1)] f: [Edge(source='f', target='c', weight=1), Edge(source='f', target='g', weight=1)] m: [Edge(source='m', target='l', weight=1)] d: [Edge(source='d', target='c', weight=1), Edge(source='d', target='e', weight=1)] h: [Edge(source='h', target='g', weight=1), Edge(source='h', target='i', weight=1)] b: [Edge(source='b', target='a', weight=1), Edge(source='b', target='c', weight=1)] a: [Edge(source='a', target='b', weight=1)] Input neighboring vertex of 'g': f is 'c' the target? [y/N] n Please input a vertex on the shortest path between 'c' and the target. The graph is: [...] Input neighboring vertex of 'c': d is 'd' the target? [y/N] n Please input a vertex on the shortest path between 'd' and the target. The graph is: [...] Input neighboring vertex of 'd': e Found target: e

## A likely story

The binary search we implemented in this post is pretty minimal. In fact, the more interesting part of the work of Emamjomeh-Zadeh et al. is the part where the response to the query can be wrong with some unknown probability.

In this case, there can be many shortest paths that are valid responses to a query, in addition to all the invalid responses. In particular, this rules out the strategy of asking the same query multiple times and taking the majority response. If the error rate is 1/3, and there are two shortest paths to the target, you can get into a situation in which you see three responses equally often and can’t choose which one is the liar.

Instead, the technique Emamjomeh-Zadeh et al. use is based on the Multiplicative Weights Update Algorithm (it strikes again!). Each query gives a multiplicative increase (or decrease) on the set of nodes that are consistent targets under the assumption that query response is correct. There are a few extra details and some postprocessing to avoid unlikely outcomes, but that’s the basic idea. Implementing it would be an excellent exercise for readers interested in diving deeper into a recent research paper (or to flex their math muscles).

But even deeper, this model of “query and get advice on how to improve” is a classic learning model first formally studied by Dana Angluin (my academic grand-advisor). In her model, one wants to design an algorithm to learn a classifier. The allowed queries are *membership* and *equivalence* queries. A membership is essentially, “What’s its label of this element?” and an equivalence query has the form, “Is this the right classifier?” If the answer is no, a mislabeled example is provided.

This is different from the usual machine learning assumption, because the learning algorithm gets to *construct* an* *example it wants to get more information about, instead of simply relying on a randomly generated subset of data. The goal is to minimize the number of queries before the target hypothesis is learned exactly. And indeed, as we saw in this post, if you have a little extra time to analyze the problem space, you can craft queries that extract quite a lot of information.

Indeed, the model we presented here for binary search on graphs is the natural analogue of an equivalence query for a search problem: instead of a mislabeled counterexample, you get a nudge in the right direction toward the target. Pretty neat!

There are a few directions we could take from here: (1) implement the Multiplicative Weights version of the algorithm, (2) apply this technique to a problem like ranking or clustering, or (3) cover theoretical learning models like membership and equivalence queries in more detail. What interests you?

Until next time!

# Linear Programming and Healthy Diets — Part 2

Previously in this series:

## Foods of the Father

My dad’s an interesting guy.

Every so often he picks up a health trend and/or weight loss goal that would make many people’s jaw drop. For example, we once went on a 5-day, 50-mile backpacking trip in the Grand Tetons, and my dad brought one of these per day for dinner, and had vitamin tablets for the rest of his sustenance. The rest of us planned for around 3,000 calories per day. He’s tried the “high fat” and “no fat” diets, and quite a few others. He’s concerned with losing weight, but also living longer, so he’s into caloric restriction among other things.

Recently he asked me to help him optimize his diet. He described a scheme he was performing by hand to minimize the number of calories he consumed per day while maintaining the minimum nutrients required by the FDA’s recommendations. He had a spreadsheet with the nutrients for each food, and a spreadsheet with the constraints on each nutrient. He wanted to come up with a collection of meals (or just throw all the food into a blender) that taste within reason but meet these criteria.

He was essentially solving a linear program by hand, roughly as best as one can, with a few hundred variables! After asking me whether there was “any kind of math” that could help him automate his laborious efforts, I decided to lend a hand. After all, it’s been over three years since I promised my readers I’d apply linear programming to optimize a diet (though it was optimizing for cost rather than calories).

Though it never went beyond what linear programming can handle, pretty quickly my dad’s requests specialized beyond what would interest a general reader. Perhaps this is the nature of math consulting, but it seems when you give someone what they want, they realize that’s not what they wanted.

But the basic ideas are still relevant enough. My solution is a hundred-ish lines of python to set up the input, using Google’s open source operations research tools as the core solver. Disclaimer: I work for Google but I don’t work on the team that wrote this tool. Also nothing in this post represents the views of my employer. It’s just me, and the scale of this problem is laughable for Google to care about anyway.

So this post is half tutorial showing how to use the or-tools python wrapper (it’s only somewhat documented), and half showing a realistic use case for linear programming.

However, don’t let this post dissuade you from the belief that linear programming is useful beyond dieting. People use linear programming to solve *all kinds* of interesting problems. Here are a few I came across in just the last few weeks:

- Factoring big matrices
- Solving zero sum games
- Optimally seating guests at a wedding (Technically they use integer programming, which or-tools can do)

And that’s not even to mention the ubiquitous applications in operations research (network flow, production optimization, economics) that every large company relies on. The applications seem endless!

As usual, all of the code and data we use in the making of this post is available at this blog’s Github page.

## Refresher

If you remember how linear programs work, you can safely skip this section.

As a refresher, let’s outline how to model the nutrition problem as a linear program and recall the basic notation. The variables are food in 100 gram increments. So might be the amount of canned peas consumed, lobster meat, etc. All variables would have to be nonnegative, of course. The objective is to minimize the total number of calories consumed. If is the amount of calories in 100g of canned peas, then one would pay in calories contributed by peas. If we have different variables, then the objective function is the linear combination

We’re using boldface to represent the vector . Likewise, will represent the cost vector of foods . As we’ve seen many times, we can compactly write the sum above as an inner product .

Finally, we require that the amount of each nutrient combined in the stuff we buy meets some threshold. So for each nutrient we have a constraint. The easiest one is calories; we require the total number of calories consumed is at least (say) 2,000. So if represents the number of calories in food , we require . We might also want to restrict a maximum number of calories, but in general having a diet with more calories implies higher cost, and so when the linear program minimizes cost we should expect it not to produce a diet with significantly more than 2,000 calories.

Since we have one set of nutrient information for each pair of (nutrient, food), we need to get fancier with the indexing. I’ll call the amount of nutrient in food . Note that will be a big matrix, so I’m saying that nutrients represent the rows of the matrix and foods represent the columns. That is, each row of the matrix represents the amount of one specific nutrient in all the foods, and each column represents the nutritional content of a single food. We’ll always use to denote the number of foods, and to denote the number of nutrients.

Finally, we have a lower and upper bound for each nutrient, which behind the scenes are converted into lower bounds (possibly negating the variables). This isn’t required to write the program, as we’ll see. In notation, we require that for every , the nutrient constraint is satisfied. If we again use vector notation for the constraints , we can write the entire set of constraints as a “matrix equation”

And this means each entry of the vector you get from multiplying the left-hand-side is greater than the corresponding entry on the right-hand-side. So the entire linear program is summarized as follows

That’s the syntactical form of our linear program. Now all (!) we need to do is pick a set of foods and nutrients, and fill in the constants for .

## Nutrients and Foods

The easier of the two pieces of data is the nutrient constraints. The system used in the United States is called the Dietary Reference Intake system. It consists of five parts, which I’ve paraphrased from Wikipedia.

**Estimated Average Requirements**(EAR), expected to satisfy the needs of 50% of the people in an age group.**Recommended Dietary Allowances**(RDA), the daily intake level considered sufficient to meet the requirements of 97.5% of healthy individuals (two standard deviations).**Adequate Intake**(AI), where no RDA has been established. Meant to complement the RDA, but has less solid evidence.**Tolerable upper intake levels**(UL), the highest level of daily consumption that have not shown harmful side effects.

While my dad come up with his own custom set of constraints, the ones I’ve posted on the github repository are essentially copy/paste from the current RDA/AI as a lower bound, with the UL as an upper bound. The values I selected are in a csv. Missing values in the upper bound column mean there is no upper bound. And sorry ladies, since it’s for my dad I chose the male values. Women have slightly different values due to different average size/weight.

Nutrient values for food are a little bit harder, because nutrient data isn’t easy to come by. There are a few databases out there, all of which are incomplete, and some of which charge for use. My dad spent a long time hunting down the nutrients (he wanted some additional special nutrients) for his top 200-odd foods.

Instead, in this post we’ll use the USDA’s free-to-use database of 8,000+ foods. It comes in a single, abbreviated, oddly-formatted text file which I’ve parsed into a csv and chosen an arbitrary subset of 800-ish foods to play around with.

## Python OR Tools

Google’s ortools docs ask you to download a tarball to install their package, but I found that was unnecessary (perhaps it’s required to attach a third-party solver to their interface?). Instead, one can just pip install it.

pip3 install py3-ortools

Then in a python script, you can import the ortools library and create a simple linear program:

from ortools.linear_solver import pywraplp solver = pywraplp.Solver('my_LP', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING) x = solver.NumVar(0, 10, 'my first variable') y = solver.NumVar(0, 10, 'my second variable') solver.Add(x + y <= 7) solver.Add(x - 2 * y >= -2) objective = solver.Objective() objective.SetCoefficient(x, 3) objective.SetCoefficient(y, 1) objective.SetMaximization() status = solver.Solve() if status not in [solver.OPTIMAL, solver.FEASIBLE]: raise Exception('Unable to find feasible solution') print(x.solution_value()) print(y.solution_value())

This provides the basic idea of the library. You can use python’s operator overloading (to an extent) to make the constraints look nice in the source code.

## Setting up the food LP

The main file `diet_optimizer.py`

contains a definition for a class, which, in addition to loading the data, encapsulates all the variables and constraints.

class DietOptimizer(object): def __init__(self, nutrient_data_filename='nutrients.csv', nutrient_constraints_filename='constraints.csv'): self.food_table = # load data into a list of dicts self.constraints_data = # load data into a list of dicts self.solver = pywraplp.Solver('diet_optimizer', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING) self.create_variable_dict() self.create_constraints() self.objective = self.solver.Objective() for row in self.food_table: name = row['description'] var = self.variable_dict[name] calories_in_food = row[calories_name] self.objective.SetCoefficient(var, calories_in_food) self.objective.SetMinimization()

We’ll get into the variables and constraints momentarily, but before that we can see the solve method

def solve(self): ''' Return a dictionary with 'foods' and 'nutrients' keys representing the solution and the nutrient amounts for the chosen diet ''' status = self.solver.Solve() if status not in [self.solver.OPTIMAL, self.solver.FEASIBLE]: raise Exception('Unable to find feasible solution') chosen_foods = { food_name: var.solution_value() for food_name, var in self.variable_dict.items() if var.solution_value() > 1e-10 } self.chosen_foods = chosen_foods nutrients = { row['nutrient']: self.nutrients_in_diet(chosen_foods, row['nutrient']) for row in self.constraints_table } return { 'foods': chosen_foods, 'nutrients': nutrients, }

Here `nutrients_in_diet`

is a helper function which, given a dictionary of foods and a nutrient, outputs the nutrient contents for that food. This can be used independently of the solver to evaluate the nutrient contents of a proposed diet.

Next we have the method to create the variables.

def create_variable_dict(self): ''' The variables are the amount of each food to include, denominated in units of 100g ''' self.variable_dict = dict( (row['description'], self.solver.NumVar(0, 10, row['description'])) for row in self.food_table )

Each food must be present in a nonnegative amount, and I’ve imposed a cap of 10 (1kg) for any individual food. The reason for this is that I originally had a “water” constraint, and the linear program decided to optimize for that by asking one to drink 2L of red wine per day. I neglected to put in an alcohol nutrient (because it was not already there and I’m lazy), so instead I limited the amount of any individual food. It still seems like a reasonable constraint to impose that nobody would want to eat more than 1kg of any single food on one day.

Finally, we can construct the constraints. The core method takes a nutrient and a lower and upper bound:

def create_constraint(self, nutrient_name, lower, upper): ''' Each constraint is a lower and upper bound on the sum of all food variables, scaled by how much of the relevant nutrient is in that food. ''' if not lower: print('Warning! Nutrient %s has no lower bound!'.format(nutrient_name)) return sum_of_foods = self.foods_for_nutrient(nutrient_name) constraint_lb = lower <= sum_of_foods self.solver.Add(constraint_lb) self.constraint_dict[nutrient_name + ' (lower bound)'] = constraint_lb if not upper: return # no upper bound in the data constraint_ub = sum_of_foods <= upper self.solver.Add(constraint_ub) self.constraint_dict[nutrient_name + ' (upper bound)'] = constraint_ub

This method is mostly bookkeeping, while `foods_for_nutrient`

does the individual nutrient lookup. Note that one is not allowed to do a double-ended inequality like `self.solver.Add(lower <= sum_of_foods <= upper)`

. If you try, ortools will ignore one end of the bound.

def foods_for_nutrient(self, nutrient_name, scale_by=1.0): # a helper function that computes the scaled sum of all food variables # for a given nutrient relevant_foods = [] for row in self.food_table: var = self.variable_dict[row['description']] nutrient_amount = row[nutrient_name] if nutrient_amount > 0: relevant_foods.append(scale_by * nutrient_amount * var) if len(relevant_foods) == 0: print('Warning! Nutrient %s has no relevant foods!'.format(nutrient_name)) return return SumArray(relevant_foods)

Here we are a bit inefficient by iterating through the entire table, instead of just those foods containing the nutrient in question. But there are only a few hundred foods in our sample database (8,000 if you use the entire SR28 database), and so the optimization isn’t necessary.

Also note that while ortools allows one to use the `sum`

method, it does so in a naive way, because `sum([a, b, c])`

becomes `((a + b) + c)`

, which is a problem because if the list is too long their library exceeds Python’s default recursion limit. Instead we construct a `SumArray`

by hand.

Finally, though we omitted it here for simplicity, throughout the code in the Github repository you’ll see references to `percent_constraints`

. This exists because some nutrients, like fat, are recommended to be restricted to a percentage of calories, not an absolute amount. So we define a mechanism to specify a nutrient should be handled with percents, and a mapping from grams to calories. This ends up using the `scale_by`

parameter above, both to scale fat by 9 calories per gram, and to scale calories to be a percentage. Cf. the special function for creating percent constraints.

Finally, we have methods just for pretty-printing the optimization problem and the solution, called `summarize_optimization_problem`

and `summarize_solution`

, respectively.

## Running the solver

Invoking the solver is trivial.

if __name__ == "__main__": solver = DietOptimizer() # solver.summarize_optimization_problem() solution = solver.solve() solver.summarize_solution(solution)

With the example foods and constraints in the github repo, the result is:

Diet: -------------------------------------------------- 298.9g: ALCOHOLIC BEV,WINE,TABLE,WHITE,MUSCAT 1000.0g: ALFALFA SEEDS,SPROUTED,RAW 38.5g: CURRY POWDER 2.1g: CUTTLEFISH,MXD SP,CKD,MOIST HEAT 31.3g: EGG,WHL,CKD,HARD-BOILED 24.0g: LOTUS ROOT,CKD,BLD,DRND,WO/SALT 296.5g: MACKEREL,JACK,CND,DRND SOL 161.0g: POMPANO,FLORIDA,CKD,DRY HEAT 87.5g: ROSEMARY,FRESH 239.1g: SWEET POTATO,CKD,BKD IN SKN,FLESH,WO/ SALT Nutrient totals -------------------------------------------------- 1700.0 mg calcium [1700.0, 2100.0] 130.0 g carbohydrate [130.0, ] 550.0 mg choline [550.0, 3500.0] 3.3 mg copper [0.9, 10.0] 60.5 g dietary fiber [38.0, ] 549.7 μg dietary folate [400.0, 1000.0] 1800.0 kcal energy [1800.0, 2100.0] 32.4 mg iron [18.0, 45.0] 681.7 mg magnesium [420.0, ] 7.3 mg manganese [2.3, 11.0] 35.0 mg niacin [16.0, 35.0] 11.7 mg pantothenic acid [5.0, ] 2554.3 mg phosphorus [1250.0, 4000.0] 14.0 g polyunsaturated fatty acids [1.6, 16.0] 4700.0 mg potassium [4700.0, ] 165.2 g protein [56.0, ] 2.8 mg riboflavin [1.3, ] 220.8 μg selenium [55.0, 400.0] 1500.0 mg sodium [1500.0, 2300.0] 2.4 mg thiamin [1.2, ] 59.4 g total fat [20.0, 35.0] (29.7% of calories) 3000.0 μg vitamin a [900.0, 3000.0] 23.0 μg vitamin b12 [2.4, ] 2.4 mg vitamin b6 [1.7, 100.0] 157.6 mg vitamin c [90.0, 2000.0] 893.0 iu vitamin d [400.0, 4000.0] 15.0 mg vitamin e [15.0, 1000.0] 349.4 μg vitamin k [120.0, ] 17.2 mg zinc [11.0, 40.0]

Unfortunately, this asks for a kilo of raw alfalfa sprouts, which I definitely would not eat. The problem is that alfalfa is ridiculously nutritious. Summarizing the diet with the `print_details`

flag set, we see they contribute a significant amount of nearly every important nutrient.

1000.0g: ALFALFA SEEDS,SPROUTED,RAW 18.8% of calcium (mg) 16.2% of carbohydrate (g) 26.2% of choline (mg) 47.3% of copper (mg) 31.4% of dietary fiber (g) 65.5% of dietary folate (μg) 12.8% of energy (kcal) 29.7% of iron (mg) 39.6% of magnesium (mg) 25.6% of manganese (mg) 13.7% of niacin (mg) 48.2% of pantothenic acid (mg) 27.4% of phosphorus (mg) 29.3% of polyunsaturated fatty acids (g) 16.8% of potassium (mg) 24.2% of protein (g) 45.1% of riboflavin (mg) 2.7% of selenium (μg) 4.0% of sodium (mg) 31.9% of thiamin (mg) 11.6% of total fat (g) 2.7% of vitamin a (μg) 13.9% of vitamin b6 (mg) 52.0% of vitamin c (mg) 1.3% of vitamin e (mg) 87.3% of vitamin k (μg) 53.5% of zinc (mg)

But ignoring that, we have some reasonable sounding foods: fish, sweet potato, rosemary (okay that’s a ton of rosemary), egg and wine. I bet someone could make a tasty meal from those rough ingredients.

## Extensions and Exercises

No tutorial would be complete without exercises. All of these are related to the actual linear program modeling problem.

**Food groups: **Suppose you had an additional column for each food called “food group.” You want to create a balanced meal, so you add additional constraint for each food group requiring some food, but not too much, from each group. Furthermore, for certain foods like spices, one could add a special constraint for each spice requiring not more than, say, 20g of any given spice. Or else, as one can see, the linear program can produce diets involving obscenely large amounts of spices.

**Starting from a given set of foods: **Supposing you have an idea for a recipe (or couple of recipes for a day’s meals), but you want to add whatever else is needed to make it meet the nutritional standards. Modify the LP to allow for this.

**Integer variations: **The ortools package supports integer programming as well. All you need to do to enable this is change the solver type to `CBC_MIXED_INTEGER_PROGRAMMING`

. The solver will run as normal, and now you can create integer-valued variables using `solver.IntVar`

instead of `NumVar`

. Using binary variables, one can define logical OR constraints (figure out how this must work). Define a new binary variable for each food, and define a constraint that makes this variable 0 when the food is not used, and 1 when the food is used. Then add a term to the optimization problem that penalizes having too many different foods in a daily diet.

**(Harder) Sampling: **Part of the motivation for this project is to come up with a number of different dishes that are all “good” with respect to this optimization problem. Perhaps there is more than one optimal solution, or perhaps there are qualitatively different diets that are close enough to optimal. However, this implementation produces a deterministic output. Find a way to introduce randomness into the program, so that you can get more than one solution.

Feel free to suggest other ideas, and extend or rewrite the model to do something completely different. The sky’s the limit!

# Notes on Math and Gerrymandering

Last week I was in Boston for the Geometry of Redistricting workshop. It was an optimistic gathering of over 500 mathematicians, computer scientists, lawyers, policy makers, teachers, and interested people of all stripes.

There was a ton of information in the talks and subsequent discussions. I’ll try to distill the main ideas and avenues for research as best I can. Unfortunately, due to how preliminary most of the technical work is, I won’t be presenting any concrete code or algorithms. Rather I’ll just give a high level sketch with as many links as I can find. If you want to skip my blabbering and watch some of the talks yourself, the morning talks were taped and publicized on YouTube.

The speakers were almost unanimous in a few central tenets, each of which I’ll expand on:

- In order for mathematics to helpful in fighting partisan gerrymandering, it needs to be in touch with the intricacies of the legal, political, and local aspects of the problem.
- The “obvious” ideas have all essentially been tried and rejected for various reasons, both mathematical and legal.
- There is a huge gray area in between what’s easy and what’s known to be hard where mathematics can help, and lots of tentative ideas are on the table.

## Grounded in People and Law

Partisan gerrymandering is the process of drawing voting district lines to help one party win more representatives. As you probably know, every 10 years the US Census Bureau tries to count everyone in the country along with basic demographic information. Then, because the US Constitution requires proportional representation in Congress based on state populations, states may gain or lose seats in Congress, and are given the power to reevaluate how they translate individual citizen votes into representatives. Part of this process is redrawing voting district lines, and when politicians take advantage of this to further a political cause, it’s called gerrymandering.

The most common kinds of gerrymandering have two forms, *packing* and *cracking,* which work in unison to give delegates of one party more representation that the global statistics of a state would suggest. *Packing* is the process of putting all of one’s opponents in the same district, so that their votes in excess of 50% are essentially wasted. In conjunction, in *cracking* one spreads out members of the opposing party in such a way that they represent a safe minority of their districts, say 45%. In this way, even a state that has an overall majority in favor of one party can result in dominant representation by the other party.

One particularly interesting example is Michigan. In 2010, state Republicans redrew district lines—as is required following every US Census. They did so in a way that is considered by many an extreme partisan gerrymander. One piece of evidence supporting this is the statistics mentioned above: while Michigan is evenly split statewide between Democrats and Republicans, Republicans took 57% of the state legislature’s House. You can see packing and cracking at work: many of the districts that voted Democratic have win-ratios in excess of 80%, while most Republican districts near those metropolitan areas eked out with leads as low as 51%. As far as I can tell, only a single Republican state senate district took significantly more than 60% of that district’s vote.

Two caveats. First: of course this is just evidence, not proof, but it should (and did) raise a red flag considering this country’s history of gerrymandering. Second: it would be unfair to single out Republicans, because Democrats are notoriously good at Gerrymandering. They’ve just been on the losing end of the battle recently.

What’s potentially worse than the fact that one party is winning is that *the dominant party gets to draw the lines.* This is a uniquely American idea that apparently horrifies visiting politicians from Europe. One obvious solution is to simply force legislatures to give up their redistricting power to independent commissions. California, along with a few other states, have done this. And other states with only one delegate, such as Montana and Delaware, don’t have this problem. But the remaining 40-some-odd states appear heavily resistant to giving up this power.

It’s obvious why incumbent politicians don’t want to relinquish their power: gerrymandering is super effective! It’s also obvious why geographic line-drawing works: people of like-minded ideology tend to live near each other. As Moon Duchin ribbed during her opening talk, “Democrats like to huddle for warmth.” More specifically, densely populated areas correlate with liberal voters. What’s more, it even correlates with density *conditioned on being in a city! *What’s more more, within a city, it correlates with how much you use public transportation! So the more data you have, the better you can gerrymander.

And, at the very least, this shows that any support or opposition for a redistricting plan needs to be heavily informed by the local, cultural, and even infrastructural details of the region in question. But even more, it shows how inherently political this process is. Politicians who don’t play the game will literally lose their jobs.

And, for the most part, courts tend to agree. In 1962 there was a landmark court case, Baker v. Carr, that crystalized this opinion (for a wonderful narrative, see the More Perfect podcast). In fact, Baker v. Carr was a case about the state’s *inaction*; the district lines in Tennessee hadn’t been redrawn since 1900, and in the mean time the population changed so that rural votes counted for about 10 urban votes. The Supreme Court ruled 6-2 against Tennessee, and laid out an opinion that said two things:

- Federal courts have the authority to hear redistricting cases on the basis of partisanship.
- However, it’s
*really*hard to tell what counts as illegal partisan gerrymandering. The court admitted that they required a “judicially discoverable and manageable standard” for resolving them.

The consequences, through this and some followup cases, were a standard essentially called “one person one vote,” which means district lines have to match up with population density. But beyond that, and some vague I-know-it-when-I-see-it notions about “compactness” and “aesthetically pleasing shapes,” the Supreme court has demurred against ruling on political gerrymandering cases for exactly this lack of a judicial standard (cf. Vieth 2004).

The supreme court doesn’t want to enter this “political thicket,” as Justice Frankfurter declared in Baker v. Carr, for authentic reasons. Foremost among them, a political issue is by definition never resolved. The losing side never thinks they had a fair shot. A justice system that hears redistricting cases (which take years to resolve) will be swamped in litigation before, during, and after every election. Some reasonably think that the only good way to end gerrymandering is for the people to hold states accountable (and/or Congress) and make them pass laws ending gerrymandering. It shouldn’t be the court’s job to make policy.

[Soapbox] In my humble and lightly-informed opinion, this is probably the right attitude. But organizing people to change such an entrenched policy requires a long, state-by-state slog over the century to win out. What’s worse is that it seems, for the average person, gerrymandering is an inherently boring topic not likely to galvanize the masses. It’s not as universal and visible as other successful movements like women’s right to vote or the Voting Rights Act, each of which took decades. Either way, via courts or state constitutional amendments, the people involved will need to lay out convincing arguments, and mathematics and computer science can certainly help. [/Soapbox]

Fast forward to today, you have Gill v. Whitford set for October 3rd, and the court is split with Justice Kennedy undecided. Kennedy, the swing vote, has had previous gerrymandering cases thrown out due to this lack of a “manageable standard,” but he’s left open the possibility that a good standard exists. Gill v. Whitford sets forth what’s called the *efficiency gap*, which may become such a standard despite its flaws.

Roughly speaking, the efficiency gap is said to measure how many votes were “wasted” on each side, via an arithmetic formula that counts both votes in excess of 50% for a winning district, and all votes cast for losing candidates. An article by Mira Bernstein and Moon Duchin (two of the four organizers of the conference) adeptly outline the deficiencies of this measure, and how it might play out in the courtroom.

## Enter math, take 1

While the judicial value of the efficiency gap remains to be seen, the mathematical parts of the workshop highlighted ways the mathematical community could be of service.

But before that, we spent some time highlighting ways that mathematics has *failed* to be of service, and the damage that does to future efforts. As one speaker put it, people came up with 60 suggested metrics for courts to use, and they were all rejected, so why would they listen to what we propose for metric number 61?

The obvious ideas one might first come up with have all been deemed to be hard or invalid. In particular, an immediate thought one would probably have is to use optimization: simply write down all the atomic units of population (census blocks), write down all the constraints, and run a solver to get the optimal district plan.

The problem is far more complicated, and one could write a long article about all the reasons why. One simple explanation is that this program is just too big. The US has 11 million census blocks (many states have around 200,000), each of which is a polygon with potentially many vertices—due to the intricacies of the geography such as rivers separating districts. As such, people have taken to representing the problem as a graph over the set of census blocks, with edges connecting adjacent nodes.

Still, almost every optimization “constraint” is soft, but it’s not clear how soft and what exceptions are permissible. For example, standard practice tries to keep towns in the same district, has to split big cities somehow, and all the while there is an unofficial mandate to keep “communities of interest” together. This last one could be people living in areas affected by local wildfires, people united by a shared language, or any one of a number of minority groups. The optimization formulation is out of touch with the guiding principles. Nobody has figured out how to represent all of the information relevant to redistricting in a way amenable to algorithmic analysis.

In this domain, rather than trying to solve a problem that is more complicated than it seems, well-modeled math must faithfully represent some set of values the court agrees with. Dr. Duchin pointed some of these out in her talk:

- Equal representation is good, as voiced by the “one person one vote” doctrine.
- Geographical division is important, i.e. bare majorities shouldn’t override the voice of communities of interest.
- Extremist agendas should not outweigh the majority.
- Elections should be competitive. The opportunity for useful alternatives should exist in areas that are equally divided.
- States should still be governable. This is why things like supermajorities are acceptable as part of the give and take of politics.

Points 2 and 3 are in direct opposition, as are 4 and 5, and each of these principles, even if you don’t agree with them, have legal standing in the US. But these principles don’t obviously translate into something like an optimization problem. How do you encode the presence of extremist agendas in an objective function? The sense I got from the workshop is that people have tried, and largely failed.

Other techniques, like evaluating the geometric shape of a district for “compactness”—a legal term long recognized by the court but with no concrete definition—run into trouble with issues like the coast of England paradox. In fact, most metrics attempting to measure non-compactness—of the kind ruled illegal in North Carolina’s 1st and 12th districts—can change by a factor of 3 based on how finely one measures geographical features. Justin Solomon covers this and many more pitfalls of the geometry of redistricting in his talk.

## Math, take 2

In addition to spending a lot of time outlining the potential pitfalls of existing techniques, the workshop presented ideas aiming to be informative.

There was much talk of metrics, but an even more fundamental problem I found appealing was that of sampling. If you’re trying to show a court that a proposed districting plan is illegally gerrymandered, it would be helpful to provide alternative plans for comparison. Even better would be a large, independent random sample of a billion plans, which you could use to draw a distribution of whatever statistic you’re interested in, and show where the proposed plan lies.

Again, extremity is not the only factor in whether a partisan gerrymander is illegal (thanks to Justice Kennedy), but it does play a role. And without a sample for comparison, “extremity” is just an opinion. Sampling was the topic of this talk by Wendy Cho and some subsequent sessions and discussions.

But the distribution of all “legal, plausibly good redistricting plans” is extremely complicated, for all of the reasons we’ve mentioned in this post. As fans of this blog will know, if you have a complicated distribution and you want to sample from it, you can use a technique called Markov Chain Monte Carlo (MCMC). Indeed, researchers have tried this (cf. the recent work of Fifield et al.) but there are some important considerations.

First, the theory of MCMC says that, if you run the algorithm for long enough, *eventually* the tail end of the sequence will be a representative sample. This is called the “mixing time,” and it’s relatively sensitive to the way the problem is set up. The theory of MCMC doesn’t say how long mixing takes, and indeed many hard problems are known to take exponential time. To complement this, problem, the existing work on using MCMC has resulted (according to experiments conveyed to me by Dr. Cho, who in fairness has a competing line of research detailed below) in almost exactly one redistricting sampled *per CPU hour*. So the second problem is that, even if you did have rapid mixing, it would take a long time and resources to come up with a good representative sample of, say, a billion possible plans.

Cho and her coauthor presented their variant of this line of research, based on genetic algorithms. In this scheme, the “optimization” part of genetic algorithms serves only to meet the constraints and desires required of all redistricting plans (keep towns together, etc.). They are able—with the aid of a supercomputer—generate many more plans and much faster. Part of the appeal of their technique is that they can turn on and off different features to optimize for at will, and dictate how much they’re trying to optimize for each feature. So they can compare a large sample of plans generated while *trying* to optimize for partisan bias to a large sample of plans generated without that objective.

On the other hand, their obstacle is a problem of purity. Genetic algorithms are about as far as you can get from a well understood theory. The particular choices required to model redistricting using genetic algorithms pretty much eliminate any hope of proving that the resulting sample is an independent random sample. This seems like a huge problem in litigation, as one could argue back and forth (as we did at the workshop) as to what features of the genetic approach made it “acceptably” random.

Indeed, maybe a sample doesn’t need to be perfectly random. Simply having alternative schemes with no discernible partisan bias might be enough for a court (who knows?). But still, I got the impression from the workshop that the more caveats one needs to explain a mathematical technique to a court, the less likely it will be accepted as legitimate. This even applies to techniques that have been accepted in mathematics for centuries.

Speaking of complicated, next we turn to Dr. Duchin’s line of work, which she succinctly describes as “measuring compactness using discrete Ricci curvature.” Ricci curvature is hard enough to describe succinctly without knowing what a smooth manifold is in technical terms. One of the many equivalent definitions is based on comparing neighborhoods of points. This lends itself nicely to a discrete analogue for graphs, an idea first put forth by Yann Ollivier. Cf. his papers on the topic, though I found them quite difficult because they lean heavily on intuition from Riemannian geometry, which I’m not very familiar with. However, the idea involves a neat concept called earth mover’s distance (which has three or four other names) which I want to write about in more detail soon. In particular, it’s part of this field of *optimal transport* which studies how to modify geometric notions (particularly measurements) to probability distributions. It’s like a “fuzzy geometry” where you don’t know exactly where things are, or exactly what shape they are, but you want to measure distance and volume keeping uncertainty in mind. Seems super useful for a field like redistricting, where the data is dirty and the lines are not absolutely certain.

Ricci curvature lends itself nicely to talking about things like “compactness” (which conveniently is a recognized concept in the courts). In particular, curvature tells you about the geometry of a manifold in ways that seem useful, but I’m not exactly sure how that connects to redistricting. Duchin et al. have been working to make that connection, and they claim it’s a more rigorous way to describe the compactness of a district. Apparently, they can translate some of the intuition from the study of geometric group theory to this problem. I eagerly await some written thing I can scour for details 🙂

Finally, the workshop had much discussion about open source GIS (geographical information systems) projects oriented around bringing redistricting planning to the public. They mentioned some really cool projects like Azavea’s DistrictBuilder, which has influenced district planning all across the country, to my surprise including my childhood home of Contra Costa County, CA. They organized a GIS hackathon for the last few days of the workshop, and they promised to mention the results of that hackathon soon.

## Next steps

For those who didn’t participate in the special sessions (which were invite only), one of which included training to be an expert witness in gerrymandering cases, there was little in the way of obvious next steps. The workshop was super informative, but at this point things are very tentative and there are a lot of open directions for research and software projects.

If you’re interested in getting involved, consider coming to one of the next workshops in the series. The upcoming October workshop in Wisconsin is open for registration, which should be particularly juicy since it will occur right after the Supreme court hears arguments on Whitford v. Gill. And there are subsequent workshops in November (Durham, North Carolina), February (Austin, Texas), and March (San Francisco, CA). I’ll likely be at the San Francisco one.

I think those with GIS experience, or those willing to learn GIS, would be quite inspired by the breadth of open problems this workshop provides. And moreover, at least a few companies that do geographic data analysis (for social causes or otherwise) were actively recruiting, so there’s that.9

Until next time!

# Boolean Logic in Polynomials

**Problem: **Express a boolean logic formula using polynomials. I.e., if an input variable is set to , that is interpreted as false, while is interpreted as true. The output of the polynomial should be 0 or 1 according to whether the formula is true or false as a whole.

**Solution:** You can do this using a single polynomial.

Illustrating with an example: the formula is also known as

not((a or b) and (not c or d))

The trick is to use multiplication for “and” and for “not.” So would be , and would be . Indeed, if you have two binary variables and then is 1 precisely when both are 1, and zero when either variable is zero. Likewise, if is zero and zero if is one.

Combine this with deMorgan’s rule to get any formula. translates to . For our example above,

Which expands to

If you plug in you get True in the original formula (because “not c or d” is False), and likewise the polynomial is

You can verify the rest work yourself, using the following table as a guide:

0, 0, 0, 0 -> 1 0, 0, 0, 1 -> 1 0, 0, 1, 0 -> 1 0, 0, 1, 1 -> 1 0, 1, 0, 0 -> 0 0, 1, 0, 1 -> 0 0, 1, 1, 0 -> 1 0, 1, 1, 1 -> 0 1, 0, 0, 0 -> 0 1, 0, 0, 1 -> 0 1, 0, 1, 0 -> 1 1, 0, 1, 1 -> 0 1, 1, 0, 0 -> 0 1, 1, 0, 1 -> 0 1, 1, 1, 0 -> 1 1, 1, 1, 1 -> 0

**Discussion: **This trick is used all over CS theory to embed boolean logic within polynomials, and it makes the name “boolean algebra” obvious, because it’s just a subset of normal algebra.

Moreover, since boolean satisfiability—the problem of algorithmically determining if a boolean formula has a satisfying assignment (a choice of variables evaluating to true)—is NP-hard, this can be used to show certain problems relating to multivariable polynomials is also hard. For example, finding roots of multivariable polynomials (even if you knew nothing about algebraic geometry) is hard because you’d run into NP-hardness by simply considering the subset of polynomials coming from boolean formulas.

Here’s a more interesting example, related to the kinds of optimization problems that show up in modern machine learning. Say you want to optimize a polynomial subject to a set of quadratic equality constraints. This is NP-hard. Here’s why.

Let be a boolean formula, and its corresponding polynomial. First, each variable used in the polynomial can be restricted to binary values via the constraint .

You can even show NP-hardness if the target function to optimize is only quadratic. As an exercise, one can express the subset sum problem as a quadratic programming problem using similar choices for the constraints. According to this writeup you even express subset sum as a quadratic program with linear constraints.

The moral of the story is simply that multivariable polynomials can encode arbitrary boolean logic.