Notes on Math and Gerrymandering

Last week I was in Boston for the Geometry of Redistricting workshop. It was an optimistic gathering of over 500 mathematicians, computer scientists, lawyers, policy makers, teachers, and interested people of all stripes.

There was a ton of information in the talks and subsequent discussions. I’ll try to distill the main ideas and avenues for research as best I can. Unfortunately, due to how preliminary most of the technical work is, I won’t be presenting any concrete code or algorithms. Rather I’ll just give a high level sketch with as many links as I can find. If you want to skip my blabbering and watch some of the talks yourself, the morning talks were taped and publicized on YouTube.

The speakers were almost unanimous in a few central tenets, each of which I’ll expand on:

  • In order for mathematics to helpful in fighting partisan gerrymandering, it needs to be in touch with the intricacies of the legal, political, and local aspects of the problem.
  • The “obvious” ideas have all essentially been tried and rejected for various reasons, both mathematical and legal.
  • There is a huge gray area in between what’s easy and what’s known to be hard where mathematics can help, and lots of tentative ideas are on the table.

Grounded in People and Law

Partisan gerrymandering is the process of drawing voting district lines to help one party win more representatives. As you probably know, every 10 years the US Census Bureau tries to count everyone in the country along with basic demographic information. Then, because the US Constitution requires proportional representation in Congress based on state populations, states may gain or lose seats in Congress, and are given the power to reevaluate how they translate individual citizen votes into representatives. Part of this process is redrawing voting district lines, and when politicians take advantage of this to further a political cause, it’s called gerrymandering.

The most common kinds of gerrymandering have two forms, packing and cracking, which work in unison to give delegates of one party more representation that the global statistics of a state would suggest. Packing is the process of putting all of one’s opponents in the same district, so that their votes in excess of 50% are essentially wasted. In conjunction, in cracking one spreads out members of the opposing party in such a way that they represent a safe minority of their districts, say 45%. In this way, even a state that has an overall majority in favor of one party can result in dominant representation by the other party.

One particularly interesting example is Michigan. In 2010, state Republicans redrew district lines—as is required following every US Census. They did so in a way that is considered by many an extreme partisan gerrymander. One piece of evidence supporting this is the statistics mentioned above: while Michigan is evenly split statewide between Democrats and Republicans, Republicans took 57% of the state legislature’s House. You can see packing and cracking at work: many of the districts that voted Democratic have win-ratios in excess of 80%, while most Republican districts near those metropolitan areas eked out with leads as low as 51%. As far as I can tell, only a single Republican state senate district took significantly more than 60% of that district’s vote.

Two caveats. First: of course this is just evidence, not proof, but it should (and did) raise a red flag considering this country’s history of gerrymandering. Second: it would be unfair to single out Republicans, because Democrats are notoriously good at Gerrymandering. They’ve just been on the losing end of the battle recently.

What’s potentially worse than the fact that one party is winning is that the dominant party gets to draw the lines. This is a uniquely American idea that apparently horrifies visiting politicians from Europe. One obvious solution is to simply force legislatures to give up their redistricting power to independent commissions. California, along with a few other states, have done this. And other states with only one delegate, such as Montana and Delaware, don’t have this problem. But the remaining 40-some-odd states appear heavily resistant to giving up this power.

Screen Shot 2017-08-10 at 8.16.26 AM

Independent commissions for US House redistricting. Green states have independent commissions, yellow states have incumbents draw district lines, gray states have only one representative.

It’s obvious why incumbent politicians don’t want to relinquish their power: gerrymandering is super effective! It’s also obvious why geographic line-drawing works: people of like-minded ideology tend to live near each other. As Moon Duchin ribbed during her opening talk, “Democrats like to huddle for warmth.” More specifically, densely populated areas correlate with liberal voters. What’s more, it even correlates with density conditioned on being in a city! What’s more more, within a city, it correlates with how much you use public transportation! So the more data you have, the better you can gerrymander.

And, at the very least, this shows that any support or opposition for a redistricting plan needs to be heavily informed by the local, cultural, and even infrastructural details of the region in question. But even more, it shows how inherently political this process is. Politicians who don’t play the game will literally lose their jobs.

And, for the most part, courts tend to agree. In 1962 there was a landmark court case, Baker v. Carr, that crystalized this opinion (for a wonderful narrative, see the More Perfect podcast). In fact, Baker v. Carr was a case about the state’s inaction; the district lines in Tennessee hadn’t been redrawn since 1900, and in the mean time the population changed so that rural votes counted for about 10 urban votes. The Supreme Court ruled 6-2 against Tennessee, and laid out an opinion that said two things:

  1. Federal courts have the authority to hear redistricting cases on the basis of partisanship.
  2. However, it’s really hard to tell what counts as illegal partisan gerrymandering. The court admitted that they required a “judicially discoverable and manageable standard” for resolving them.

The consequences, through this and some followup cases, were a standard essentially called “one person one vote,” which means district lines have to match up with population density. But beyond that, and some vague I-know-it-when-I-see-it notions about “compactness” and “aesthetically pleasing shapes,” the Supreme court has demurred against ruling on political gerrymandering cases for exactly this lack of a judicial standard (cf. Vieth 2004).

NC-house.jpg

North Carolina’s famously gerrymandered 1st and 12th house districts, rejected by the courts on account of ugliness and “tortured” shape. Source: wunc.org

The supreme court doesn’t want to enter this “political thicket,” as Justice Frankfurter declared in Baker v. Carr, for authentic reasons. Foremost among them, a political issue is by definition never resolved. The losing side never thinks they had a fair shot. A justice system that hears redistricting cases (which take years to resolve) will be swamped in litigation before, during, and after every election. Some reasonably think that the only good way to end gerrymandering is for the people to hold states accountable (and/or Congress) and make them pass laws ending gerrymandering. It shouldn’t be the court’s job to make policy.

[Soapbox] In my humble and lightly-informed opinion, this is probably the right attitude. But organizing people to change such an entrenched policy requires a long, state-by-state slog over the century to win out. What’s worse is that it seems, for the average person, gerrymandering is an inherently boring topic not likely to galvanize the masses. It’s not as universal and visible as other successful movements like women’s right to vote or the Voting Rights Act, each of which took decades. Either way, via courts or state constitutional amendments, the people involved will need to lay out convincing arguments, and mathematics and computer science can certainly help. [/Soapbox]

Fast forward to today, you have Gill v. Whitford set for October 3rd, and the court is split with Justice Kennedy undecided. Kennedy, the swing vote, has had previous gerrymandering cases thrown out due to this lack of a “manageable standard,” but he’s left open the possibility that a good standard exists. Gill v. Whitford sets forth what’s called the efficiency gap, which may become such a standard despite its flaws.

Roughly speaking, the efficiency gap is said to measure how many votes were “wasted” on each side, via an arithmetic formula that counts both votes in excess of 50% for a winning district, and all votes cast for losing candidates. An article by Mira Bernstein and Moon Duchin (two of the four organizers of the conference) adeptly outline the deficiencies of this measure, and how it might play out in the courtroom.

Enter math, take 1

While the judicial value of the efficiency gap remains to be seen, the mathematical parts of the workshop highlighted ways the mathematical community could be of service.

But before that, we spent some time highlighting ways that mathematics has failed to be of service, and the damage that does to future efforts. As one speaker put it, people came up with 60 suggested metrics for courts to use, and they were all rejected, so why would they listen to what we propose for metric number 61?

The obvious ideas one might first come up with have all been deemed to be hard or invalid. In particular, an immediate thought one would probably have is to use optimization: simply write down all the atomic units of population (census blocks), write down all the constraints, and run a solver to get the optimal district plan.

The problem is far more complicated, and one could write a long article about all the reasons why. One simple explanation is that this program is just too big. The US has 11 million census blocks (many states have around 200,000), each of which is a polygon with potentially many vertices—due to the intricacies of the geography such as rivers separating districts. As such, people have taken to representing the problem as a graph over the set of census blocks, with edges connecting adjacent nodes.

Still, almost every optimization “constraint” is soft, but it’s not clear how soft and what exceptions are permissible. For example, standard practice tries to keep towns in the same district, has to split big cities somehow, and all the while there is an unofficial mandate to keep “communities of interest” together. This last one could be people living in areas affected by local wildfires, people united by a shared language, or any one of a number of minority groups. The optimization formulation is out of touch with the guiding principles. Nobody has figured out how to represent all of the information relevant to redistricting in a way amenable to algorithmic analysis.

In this domain, rather than trying to solve a problem that is more complicated than it seems, well-modeled math must faithfully represent some set of values the court agrees with. Dr. Duchin pointed some of these out in her talk:

  1. Equal representation is good, as voiced by the “one person one vote” doctrine.
  2. Geographical division is important, i.e. bare majorities shouldn’t override the voice of communities of interest.
  3. Extremist agendas should not outweigh the majority.
  4. Elections should be competitive. The opportunity for useful alternatives should exist in areas that are equally divided.
  5. States should still be governable. This is why things like supermajorities are acceptable as part of the give and take of politics.

Points 2 and 3 are in direct opposition, as are 4 and 5, and each of these principles, even if you don’t agree with them, have legal standing in the US. But these principles don’t obviously translate into something like an optimization problem. How do you encode the presence of extremist agendas in an objective function? The sense I got from the workshop is that people have tried, and largely failed.

Other techniques, like evaluating the geometric shape of a district for “compactness”—a legal term long recognized by the court but with no concrete definition—run into trouble with issues like the coast of England paradox. In fact, most metrics attempting to measure non-compactness—of the kind ruled illegal in North Carolina’s 1st and 12th districts—can change by a factor of 3 based on how finely one measures geographical features. Justin Solomon covers this and many more pitfalls of the geometry of redistricting in his talk.

Screen Shot 2017-08-12 at 12.35.06 PM.png

Maryland’s district 1, with compactness measured at different resolutions. Taken from Dr. Solomon’s slides.

Math, take 2

In addition to spending a lot of time outlining the potential pitfalls of existing techniques, the workshop presented ideas aiming to be informative.

There was much talk of metrics, but an even more fundamental problem I found appealing was that of sampling. If you’re trying to show a court that a proposed districting plan is illegally gerrymandered, it would be helpful to provide alternative plans for comparison. Even better would be a large, independent random sample of a billion plans, which you could use to draw a distribution of whatever statistic you’re interested in, and show where the proposed plan lies.

Again, extremity is not the only factor in whether a partisan gerrymander is illegal (thanks to Justice Kennedy), but it does play a role. And without a sample for comparison, “extremity” is just an opinion. Sampling was the topic of this talk by Wendy Cho and some subsequent sessions and discussions.

But the distribution of all “legal, plausibly good redistricting plans” is extremely complicated, for all of the reasons we’ve mentioned in this post. As fans of this blog will know, if you have a complicated distribution and you want to sample from it, you can use a technique called Markov Chain Monte Carlo (MCMC). Indeed, researchers have tried this (cf. the recent work of Fifield et al.) but there are some important considerations.

First, the theory of MCMC says that, if you run the algorithm for long enough, eventually the tail end of the sequence will be a representative sample. This is called the “mixing time,” and it’s relatively sensitive to the way the problem is set up. The theory of MCMC doesn’t say how long mixing takes, and indeed many hard problems are known to take exponential time. To complement this, problem, the existing work on using MCMC has resulted (according to experiments conveyed to me by Dr. Cho, who in fairness has a competing line of research detailed below) in almost exactly one redistricting sampled per CPU hour. So the second problem is that, even if you did have rapid mixing, it would take a long time and resources to come up with a good representative sample of, say, a billion possible plans.

Cho and her coauthor presented their variant of this line of research, based on genetic algorithms. In this scheme, the “optimization” part of genetic algorithms serves only to meet the constraints and desires required of all redistricting plans (keep towns together, etc.). They are able—with the aid of a supercomputer—generate many more plans and much faster. Part of the appeal of their technique is that they can turn on and off different features to optimize for at will, and dictate how much they’re trying to optimize for each feature. So they can compare a large sample of plans generated while trying to optimize for partisan bias to a large sample of plans generated without that objective.

distrcting-dists

An example of a sample of districting plans plotted in terms of some statistics. From Cho-Liu 2016

On the other hand, their obstacle is a problem of purity. Genetic algorithms are about as far as you can get from a well understood theory. The particular choices required to model redistricting using genetic algorithms pretty much eliminate any hope of proving that the resulting sample is an independent random sample. This seems like a huge problem in litigation, as one could argue back and forth (as we did at the workshop) as to what features of the genetic approach made it “acceptably” random.

Indeed, maybe a sample doesn’t need to be perfectly random. Simply having alternative schemes with no discernible partisan bias might be enough for a court (who knows?). But still, I got the impression from the workshop that the more caveats one needs to explain a mathematical technique to a court, the less likely it will be accepted as legitimate. This even applies to techniques that have been accepted in mathematics for centuries.

Speaking of complicated, next we turn to Dr. Duchin’s line of work, which she succinctly describes as “measuring compactness using discrete Ricci curvature.” Ricci curvature is hard enough to describe succinctly without knowing what a smooth manifold is in technical terms. One of the many equivalent definitions is based on comparing neighborhoods of points. This lends itself nicely to a discrete analogue for graphs, an idea first put forth by Yann Ollivier. Cf. his papers on the topic, though I found them quite difficult because they lean heavily on intuition from Riemannian geometry, which I’m not very familiar with. However, the idea involves a neat concept called earth mover’s distance (which has three or four other names) which I want to write about in more detail soon. In particular, it’s part of this field of optimal transport which studies how to modify geometric notions (particularly measurements) to probability distributions. It’s like a “fuzzy geometry” where you don’t know exactly where things are, or exactly what shape they are, but you want to measure distance and volume keeping uncertainty in mind. Seems super useful for a field like redistricting, where the data is dirty and the lines are not absolutely certain.

Ricci curvature lends itself nicely to talking about things like “compactness” (which conveniently is a recognized concept in the courts). In particular, curvature tells you about the geometry of a manifold in ways that seem useful, but I’m not exactly sure how that connects to redistricting. Duchin et al. have been working to make that connection, and they claim it’s a more rigorous way to describe the compactness of a district. Apparently, they can translate some of the intuition from the study of geometric group theory to this problem. I eagerly await some written thing I can scour for details 🙂

Finally, the workshop had much discussion about open source GIS (geographical information systems) projects oriented around bringing redistricting planning to the public. They mentioned some really cool projects like Azavea’s DistrictBuilder, which has influenced district planning all across the country, to my surprise including my childhood home of Contra Costa County, CA. They organized a GIS hackathon for the last few days of the workshop, and they promised to mention the results of that hackathon soon.

Next steps

For those who didn’t participate in the special sessions (which were invite only), one of which included training to be an expert witness in gerrymandering cases, there was little in the way of obvious next steps. The workshop was super informative, but at this point things are very tentative and there are a lot of open directions for research and software projects.

If you’re interested in getting involved, consider coming to one of the next workshops in the series. The upcoming October workshop in Wisconsin is open for registration, which should be particularly juicy since it will occur right after the Supreme court hears arguments on Whitford v. Gill. And there are subsequent workshops in November (Durham, North Carolina), February (Austin, Texas), and March (San Francisco, CA). I’ll likely be at the San Francisco one.

I think those with GIS experience, or those willing to learn GIS, would be quite inspired by the breadth of open problems this workshop provides. And moreover, at least a few companies that do geographic data analysis (for social causes or otherwise) were actively recruiting, so there’s that.9

Until next time!

Boolean Logic in Polynomials

Problem: Express a boolean logic formula using polynomials. I.e., if an input variable x is set to 0, that is interpreted as false, while x=1 is interpreted as true. The output of the polynomial should be 0 or 1 according to whether the formula is true or false as a whole.

Solution: You can do this using a single polynomial.

Illustrating with an example: the formula is \neg[(a \vee b) \wedge (\neg c \vee d)] also known as

not((a or b) and (not c or d))

The trick is to use multiplication for “and” and 1-x for “not.” So a \wedge b would be x_1 x_2, and \neg z would be 1-z. Indeed, if you have two binary variables x and y then xy is 1 precisely when both are 1, and zero when either variable is zero. Likewise, 1-x = 1 if x is zero and zero if x is one.

Combine this with deMorgan’s rule to get any formula. a \vee b = \neg(\neg a \wedge \neg b) translates to 1 - (1-a)(1-b). For our example above,

\displaystyle f(x_1, x_2, x_3, x_4) = 1 - (1 - (1-a)(1-b))(1 - c(1-d))

Which expands to

\displaystyle 1 - a - b + ab + (1-d)(ac + bc - abc)

If you plug in a = 1, b = 0, c = 1, d = 0 you get True in the original formula (because “not c or d” is False), and likewise the polynomial is

\displaystyle 1 - 1 - 0 + 0 + (1-0)(1 + 0 - 0) = 1

You can verify the rest work yourself, using the following table as a guide:

0, 0, 0, 0 -> 1
0, 0, 0, 1 -> 1
0, 0, 1, 0 -> 1
0, 0, 1, 1 -> 1
0, 1, 0, 0 -> 0
0, 1, 0, 1 -> 0
0, 1, 1, 0 -> 1
0, 1, 1, 1 -> 0
1, 0, 0, 0 -> 0
1, 0, 0, 1 -> 0
1, 0, 1, 0 -> 1
1, 0, 1, 1 -> 0
1, 1, 0, 0 -> 0
1, 1, 0, 1 -> 0
1, 1, 1, 0 -> 1
1, 1, 1, 1 -> 0

Discussion: This trick is used all over CS theory to embed boolean logic within polynomials, and it makes the name “boolean algebra” obvious, because it’s just a subset of normal algebra.

Moreover, since boolean satisfiability—the problem of algorithmically determining if a boolean formula has a satisfying assignment (a choice of variables evaluating to true)—is NP-hard, this can be used to show certain problems relating to multivariable polynomials is also hard. For example, finding roots of multivariable polynomials (even if you knew nothing about algebraic geometry) is hard because you’d run into NP-hardness by simply considering the subset of polynomials coming from boolean formulas.

Here’s a more interesting example, related to the kinds of optimization problems that show up in modern machine learning. Say you want to optimize a polynomial f(x) subject to a set of quadratic equality constraints. This is NP-hard. Here’s why.

Let \varphi be a boolean formula, and f_\varphi its corresponding polynomial. First, each variable x_i used in the polynomial can be restricted to binary values via the constraint x_i(x_i - 1) = 0.

You can even show NP-hardness if the target function to optimize is only quadratic. As an exercise, one can express the subset sum problem as a quadratic programming problem using similar choices for the constraints. According to this writeup you even express subset sum as a quadratic program with linear constraints.

The moral of the story is simply that multivariable polynomials can encode arbitrary boolean logic.

Mathematical Genealogy

As a fun side project to distract me from my abysmal progress on my book, I decided to play around with the math genealogy graph!

For those who don’t know, since 1996, mathematicians, starting with the labor of Harry Coonce et al, have been managing a database of all mathematicians. More specifically, they’ve been keeping track of who everyone’s thesis advisors and subsequent students were. The result is a directed graph (with a current estimated 200k nodes) that details the scientific lineage of mathematicians.

math-genealogy-website

Anyone can view the database online and explore the graph by hand. In it are legends like Gauss, Euler, and Noether, along with the sizes of their descendant subtrees. Here’s little ol’ me.

It’s fun to look at who is in your math genealogy, and I’ve spent more than a few minutes clicking until I get to the top of a tree (since a person can have multiple advisors, finding the top is time consuming), like the sort of random walk that inspired Google’s PageRank and Wikipedia link clicking games.

Inspired by a personalized demo by Colin Wright, I decided it would be fun to scrape the website, get a snapshot of the database, and then visualize and play with the graph. So I did.

Here’s a github repository with the raw data and scraping script. It includes a full json dump of what I scraped as of a few days ago. It’s only ~60MB.

Then, using a combination of tools, I built a rudimentary visualizer. Go play with it!

mathgenealogyexample

A subset of my mathematical family tree.

A few notes:

  1. It takes about 15 seconds to load before you can start playing. During this time, it loads a compressed version of the database into memory (starting from a mere 5MB). Then it converts the data into a more useful format, builds a rudimentary search index of the names, and displays the ancestors for Gauss.
  2. The search index is the main bloat of the program, requiring about a gigabyte of memory to represent. Note that because I’m too lazy to set up a proper server and elasticsearch index, everything in this demo is in Javascript running in your browser. Here’s the github repo for that code.
  3. You can drag and zoom the graph.
  4. There was a fun little bit of graph algorithms involved in this project, such as finding the closest common ancestor of two nodes. This is happening in a general digraph, not necessarily a tree, so there are some extra considerations. I isolated all the graph algorithms to one file.
  5. People with even relatively few descendants generate really wide graphs. This is because each layer in the directed graph is assigned to a layer, and, the potentially 100+ grandchildren of a single node will be laid out in the same layer. I haven’t figured out how to constrain the width of the rendered graph (anyone used dagre/dagre-d3?), nor did I try very hard.
  6. The dagre layout package used here is a port of the graphviz library. It uses linear programming and the simplex algorithm to determine an optimal layout that penalizes crossed edges and edges that span multiple layers, among other things. Linear programming strikes again! For more details on this, see this paper outlining the algorithm.
  7. The scraping algorithm was my first time using Python 3’s asyncio features. The concepts of asynchronous programming are not strange to me, but somehow the syntax of this module is.

Feature requests, bugs, or ideas? Open an issue on Github or feel free to contribute a pull request! Enjoy.

Duality for the SVM

This post is a sequel to Formulating the Support Vector Machine Optimization Problem.

The Karush-Kuhn-Tucker theorem

Generic optimization problems are hard to solve efficiently. However, optimization problems whose objective and constraints have special structure often succumb to analytic simplifications. For example, if you want to optimize a linear function subject to linear equality constraints, one can compute the Lagrangian of the system and find the zeros of its gradient. More generally, optimizing a linear function subject to linear equality and inequality constraints can be solved using various so-called “linear programming” techniques, such as the simplex algorithm.

However, when the objective is not linear, as is the case with SVM, things get harder. Likewise, if the constraints don’t form a convex set you’re (usually) out of luck from the standpoint of analysis. You have to revert to numerical techniques and cross your fingers. Note that the set of points satisfying a collection of linear inequalities forms a convex set, provided they can all be satisfied.

We are in luck. The SVM problem can be expressed as a so-called “convex quadratic” optimization problem, meaning the objective is a quadratic function and the constraints form a convex set (are linear inequalities and equalities). There is a neat theorem that addresses such, and it’s the “convex quadratic” generalization of the Lagrangian method. The result is due to Karush, Kuhn, and Tucker, (dubbed the KKT theorem) but we will state a more specific case that is directly applicable to SVM.

Theorem [Karush 1939, Kuhn-Tucker 1951]: Suppose you have an optimization problem in \mathbb{R}^n of the following form:

\displaystyle \min f(x), \text{ subject to } g_i(x) \leq 0, i = 1, \dots, m

Where f is a differentiable function of the input variables x and g_1, \dots, g_m are affine (degree-1 polynomials). Suppose z is a local minimum of f. Then there exist constants (called KKT or Lagrange multipliers) \alpha_1, \dots, \alpha_m such that the following are true. Note the parenthetical labels contain many intentionally undefined terms.

  1. - \nabla f(z) = \sum_{i=1}^m \alpha_i \nabla g_i(z) (gradient of Lagrangian is zero)
  2. g_i(z) \leq 0 for all i = 1, \dots, m (primal constraints are satisfied)
  3. \alpha_i \geq 0 for all i = 1, \dots, m (dual constraints are satisfied)
  4. \alpha_i g_i(z) = 0 for all i = 1, \dots, m (complementary slackness conditions)

We’ll discuss momentarily how to interpret these conditions, but first a few asides. A large chunk of the work in SVMs is converting the original, geometric problem statement, that of maximizing the margin of a linear separator, into a form suitable for this theorem. We did that last time. However, the conditions of this theorem also provide the structure for a more analytic algorithm, the Sequential Minimal Optimization algorithm, which allows us to avoid numerical methods. We’ll see how this works explicitly next time when we implement SMO.

You may recall that for the basic Lagrangian, each constraint in the optimization problem corresponds to one Lagrangian multiplier, and hence one term of the Lagrangian. Here it’s largely the same—each constraint  in the SVM problem (and hence each training point) corresponds to a KKT multiplier—but in addition each KKT multiplier corresponds to a constraint for a new optimization problem that this theorem implicitly defines (called the dual problem). So the pseudocode of the Sequential Minimal Optimization algorithm is to start with some arbitrary separating hyperplane w, and find any training point x_j that corresponds to a violated constraint. Fix w so it works for x_j, and repeat until you can’t find any more violated constraints.

Now to interpret those four conditions. The difficulty in this part of the discussion is in the notion of primal/dual problems. The “original” optimization problem is often called the “primal” problem. While a “primal problem” can be either a minimization or a maximization (and there is a corresponding KKT theorem for each) we’ll use the one of the form:

\displaystyle \min f(x), \text{subject to } g_i(x) \leq 0, i = 1, \dots, m

Next we define a corresponding “dual” optimization problem, which is a maximization problem whose objective and constraints are related to the primal in a standard, but tedious-to-write-down way. In general, this dual maximization problem has the guarantee that its optimal solution (a max) is a lower bound on the optimal solution for the primal (a min). This can be useful in many settings. In the most pleasant settings, including SVM, you get an even stronger guarantee, that the optimal solutions for the primal and dual problems have equal objective value. That is, the bound that the dual objective provides on the primal optimum is tight. In that case, the primal and dual are two equivalent perspectives on the same problem. Solving the dual provides a solution to the primal, and vice versa.

The KKT theorem implicitly defines a dual problem, which can only possibly be clear from the statement of the theorem if you’re intimately familiar with duals and Lagrangians already. This dual problem has variables \alpha = (\alpha_1, \dots, \alpha_m), one entry for each constraint of the primal. For KKT, the dual constraints are simply non-negativity of the variables

\displaystyle \alpha_j \geq 0 \text{ for all } j

And the objective for the dual is this nasty beast

\displaystyle d(\alpha) = \inf_{x} L(x, \alpha)

where L(x, \alpha) is the generalized Lagrangian (which is simpler in this writeup because the primal has no equality constraints), defined as:

\displaystyle L(x, \alpha) = f(x) + \sum_{i=1}^m \alpha_i g_i(x)

While a proper discussion of primality and duality could fill a book, we’ll have to leave it at that. If you want to journey deeper into this rabbit hole, these notes give a great introduction from the perspective of the classical Lagrangian, without any scarring.

But we can begin to see why the KKT conditions are the way they are. The first requires the generalized Lagrangian has gradient zero. Just like with classical Lagrangians, this means the primal objective is at a local minimum. The second requires the constraints of the primal problem to be satisfied. The third does the same for the dual constraints. The fourth is the interesting one, because it says that at an optimal solution, the primal and dual constraints are intertwined.

4. \alpha_i g_i(z) = 0 for all i = 1, \dots, m (complementary slackness conditions)

More specifically, these “complementary slackness” conditions require that for each i, either the dual constraint \alpha_i \geq 0 is actually tight (\alpha_i = 0), or else the primal constraint g_i is tight. At least one of the two must be exactly at the limit (equal to zero, not strictly less than). The “product equals zero means one factor is zero” trick comes in handy here to express an OR, despite haunting generations of elementary algebra students. In terms of the SVM problem, complementary slackness translates to the fact that, for the optimal separating hyperplane w, if a data point doesn’t have functional margin exactly 1, then that data point isn’t a support vector. Indeed, when \alpha_i = 0 we’ll see in the next section how that affects the corresponding training point x_i.

The nitty gritty for SVM

Now that we’ve recast the SVM into a form suitable for the KKT theorem, let’s compute the dual and understand how these dual constraints are related to the optimal solution of the primal SVM problem.

The primal problem statement is

\displaystyle \min_{w} \frac{1}{2} \| w \|^2

Subject to the constraints that all m training points x_1, \dots, x_m with training labels y_1, \dots, y_m satisfy

\displaystyle (\langle w, x_i \rangle + b) \cdot y_i \geq 1

Which we can rewrite as

\displaystyle 1 - (\langle w, x_i \rangle + b) \cdot y_i \leq 0

The generalized Lagrangian is

\displaystyle \begin{aligned}  L(w, b, \alpha) &= \frac{1}{2} \| w \|^2 + \sum_{j=1}^m \alpha_j(1-y_j \cdot (\langle w, x_j \rangle + b)) \\ &= \frac{1}{2} \| w \|^2 + \sum_{j=1}^m \alpha_j - \sum_{j=1}^m \alpha_j y_j \cdot \langle w, x_j \rangle - \sum_{j=1}^m \alpha_j y_j b \end{aligned}

We can compute each component of the gradient \nabla L, indexed by the variables w_i, b, and \alpha_j. First, since this simplifies the Lagrangian a bit, we compute \frac{\partial L}{\partial b}.

\displaystyle \frac{\partial L}{\partial b} = -\sum_{j=1}^m y_j \alpha_j

The condition that the gradient is zero implies this entry is zero, i.e. \sum_{j=1}^m y_j \alpha_j = 0. In particular, and this will be a helpful reminder for next post, we could add this constraint to the dual problem formulation without changing the optimal solution, allowing us to remove the term b \sum_{j=1}^m y_j \alpha_j from the Lagrangian since it’s zero. We will use this reminder again when we implement the Sequential Minimal Optimization algorithm next time.

Next, the individual components w_i of w.

\displaystyle \frac{\partial L}{\partial w_i} = w_i - \sum_{j=1}^m \alpha_j y_j x_{j,i}

Note that x_{i,j} is the i-th component of the j-th training point x_j, since this is the only part of the expression w \cdot x_j that involves w_i.

Setting all these equal to zero means we require w = \sum_{j=1}^m \alpha_j y_j x_j. This is interesting! The optimality criterion, that the gradient of the Lagrangian must be zero, actually shows us how to write the optimal solution w in terms of the Lagrange multipliers \alpha_j and the training data/labels. It also hints at the fact that, because of this complementary slackness condition, many of the \alpha_i will turn out to be zero, and hence the optimal solution can be written as a sparse sum of the training examples.

And, now that we have written w in terms of the \alpha_j, we can eliminate w in the formula for the Lagrangian and get a dual optimization objective only in terms of the \alpha_j. Substituting (and combining the resulting two double sums whose coefficients are \frac{1}{2} and -1), we get

\displaystyle L(\alpha) = \sum_{j=1}^m \alpha_j - \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle

Again foreshadowing, the fact that this form only depends on the inner products of the training points will allow us to replace the standard (linear) inner product for a nonlinear “inner-product-like” function, called a kernel, that will allow us to introduce nonlinearity into the decision boundary.

Now back to differentiating the Lagrangian. For the remaining entries of the Lagrangian where the variable is a KKT multiplier, it coincides with the requirement that the constraints of the primal are satisfied:

\displaystyle \frac{\partial L}{\partial \alpha_j} = 1 - y_j (\langle w, x_j \rangle + b) \leq 0

Next, the KKT theorem says that one needs to have both feasibility of the dual:

\displaystyle \alpha_j \geq 0 \text{ for all } j

And finally the complementary slackness conditions,

\displaystyle \alpha_j (1 - y_j (\langle w, x_j \rangle + b)) = 0 \text{ for all } j = 1, \dots, m

To be completely clear, the dual problem for the SVM is just the generalized Lagrangian:

\displaystyle \max_{\alpha} (\inf_x L(x, \alpha))

subject to the non-negativity constraints:

\displaystyle \alpha_i \geq 0

And the one (superfluous reminder) equality constraint

\displaystyle \sum_{j=1}^m y_j \alpha_j = 0

All of the equality constraints above (Lagrangian being zero, complementary slackness, and this reminder constraint) are consequences of the KKT theorem.

At this point, we’re ready to derive and implement the Sequential Minimal Optimization Algorithm and run it on some data. We’ll do that next time.