AMS Network Science Mathematical Research Community

I don’t usually write promotional posts because I don’t enjoy reading them as much as I enjoy reading the technical posts. But I know that a lot of early graduate students and undergraduates read my blog, and this would be of interest to many of them.

I just got back from Utah yesterday where I attended a 5-day workshop run by the American Mathematical Society, called the Network Science Mathematical Research Community (MRC).

The point of the program is to bring graduate students and early career folks together from all over the country to start new collaborations. The AMS runs multiple MRC sessions every year, and this year the topics ranged from network science to quantum physics. We had a group of about 20 people, including statisticians, applied mathematicians, computer scientists, and a handful of pure combinatorialists. We self-organized into groups of four, and spent pretty much all day for the next four days eating great food, thinking about problems, proving theorems, enjoying the view, and discussing our ideas with the three extremely smart, successful, and amicable organizers. There were also career panels every evening that were, in my opinion, better than the average career panel.

The network science group (you can see me peeking out from the back).

The network science group (you can see me peeking out from the back, just left of center).

Anyway, it was a really fun and valuable experience, and the AMS pays for everything and a bag of chips (if by chips you mean more travel money to meet up with your collaborators and a ticket to the AMS Joint Mathematics Meeting the following January). I’m excited to blog about the work that come out of this, as network science is right up there with the coolest of topics in math and programming.

So if you’re eligible, keep an eye out for next year’s program.

About these ads

Three Years Old, and an Idea for a Podcast

Happy birthday, Math ∩ Programming!

Today marks the end of the third year I’ve been writing Math ∩ Programming, and I’m excited to keep it going as I start my research career. In the last year I’ve started a secondary writing blog for some smaller, less technical bits, mostly to get thoughts out of my head. And while I could use this anniversary post to preview future Math ∩ Programming posts or review old favorites, I’ll instead share an idea that has been bouncing around my head for a few weeks. I’d love to hear your feedback in the comments.

I listen to podcasts and radio shows a lot, mostly storytelling and interviews. And they’re always bringing on these fancy-sounding people who write books on the New York Times Bestsellers list and who often have very interesting things to say. When discussing science they can often convey the ideas to the clueless listener, usually because it’s experimental science that’s naturally easy to understand (state the setup, state the results, hypothesize about the implications). But almost unilaterally there’s nothing substantive about math. All the mathematical content is popular math, how beautiful is \pi and such; math education, which I love to read and talk about but is common; or math history, which I’m not as interested in. And when there is some breakthrough, like Grigori Perelman solving a Millennium prize problem, the focus is entirely on the person and not the achievement. This isn’t specific to podcasts, but all news. I just happen to prefer my news in podcast form.

And so, aside from the myriad of excellent technical blogs by active researchers, what is there really that conveys the excitement I experience in theoretical computer science? There are publications like the ACM SIGACT monthly newsletter, which has a ton of book reviews and a handful of technical columns. Unfortunately it’s hidden behind a paywall, which basically immediately excludes it from being accessed by anyone not already embedded deep in academia. That being said it often has really interesting pieces like a poll by Bill Gasarch (2002, 2012) of researchers and their opinions on P vs NP. It’s really interesting to see just how much people differ on their desire to see other parts of mathematics incorporated into its resolution.

So if you don’t want to pay the ACM for a monthly newsletter, what can you do? Many of these ideas and opinions don’t exist in textbooks, and textbooks can be dry and bad at conveying why things are interesting or exciting. There are abstruse technical papers that you have to finish a graduate degree before you can even parse what’s being said. And then there are talks, which vary in quality almost as much as prose in technical papers do.

I recently came across a paper by Ryan Williams, a prominent researcher in circuit complexity. Roughly, when you study circuit complexity you try to understand which problems provably require big circuits to solve, and you study those proof techniques. It sounds boring but it’s interesting for three reasons: it’s extremely hard, there are many “embarrassing” open problems, and many of these problems imply wonderful things like P \neq NP. I actually get really excited by circuit complexity.

Anyway, this paper was titled “A Casual Tour Around a Circuit Complexity Bound,” in which Ryan reflects on the path which led him to one of the biggest breakthroughs in the last five years in circuit complexity. His writing is more or less informal (it was published in the SIGACT newsletter, though I had to access it through arXiv), and it focuses heavily on the big picture. It struck me as mostly how to think about circuit complexity. This kind of thing is truly invaluable for a graduate student and anyone, I imagine, trying to learn more about circuit complexity. Honestly, I’d love to see more of this in academic literature. Often papers are expanded from relatively simple principles into a mess of technical details, and reversing this process is slow and difficult.

But even besides these huge breakthroughs there are often really great ways to explain new problems and solutions. For example, this paper of Andrew Drucker, titled “High-Confidence Predictions under Adversarial Uncertainty,” starts with a really easy to understand setup:

A frog wants to cross the road at some fixed location, to get to a nice pond. But she is concerned about cars. It takes her a minute to cross the road, and if a car passes during that time, she will be squashed. However, this is no ordinary frog. She is extremely patient, and happy to wait any finite number of steps to cross the road. What’s more, she can observe and remember how many cars have passed, as well as when they passed. She can follow any algorithm to determine when to cross the road based on what she has seen so far, although her senses aren’t keen enough to detect a car before it arrives…

[Even if we assume the cars arrive according to a fixed probability distribution,] the frog may not have a detailed idea of how the cars are generated. It may be that the frog merely knows or conjectures some constraint obeyed by the car-stream. We then ask whether there exists a strategy which gets the frog safely across the road (at least, with sufficiently high probability), for any car-stream obeying the constraint.

This kind of story is better than coffee at keeping people awake during talks!

And so, I have been thinking a lot about what a podcast about theoretical computer science might entail. I imagine it going something like this: every episode is a half-hour conversation with a prominent researcher. The discussion would cover something about past work, something about future ideas of what’s important and a high level idea of the burgeoning techniques, and overarching questions about how one approaches research. Computer science is particularly interesting because most graduate students know enough to start working on open problems in their first year (so the topics are more accessible than, say, algebraic geometry), and because basically all of the theorems with names are named after people still active in the research community. Moreover the format of a podcast would require the interviewees to phrase their research in a way that doesn’t require a chalkboard or notation.

The hope is to popularize theoretical computer science by assuming some modest level of technical proficiency and to give access to anyone who wants to listen; say, an advanced undergraduate, an early graduate student, or a redditor who likes to argue about the Turing test. Moreover, if I were to actually run such a podcast, I could fill the listener in with additional details in a preface. The podcast could be a resource for undergraduates who want to explore the landscape of research topics before applying to graduate school, for graduate students who want to learn to think like a researcher or hear the variety of views out there, for researchers to advertise their favorite topics and get people to read/cite their papers, and for me to meet and interact with all of these great people. My advisor seems to know everyone and their lemmas, and more and more I’ve been finding myself wanting this as well (I’m just so terrible with names!). I’ve had enough conversations with researchers to know they have heaps of interesting things to say. And I travel enough to conferences and workshops. I imagine it wouldn’t be too hard to orchestrate a 30-minute conversation with one or two people. I’d just have to work up the courage to ask :)

What do you think of the idea? Would you listen to a theory podcast? Do you have a good idea for a catchy name? Are you a theory researcher who would like to have a conversation with me the next time we’re at a conference together? *fingers crossed* I’d love to hear from you.

Linear Programming and the Most Affordable Healthy Diet — Part 1

Optimization is by far one of the richest ways to apply computer science and mathematics to the real world. Everybody is looking to optimize something: companies want to maximize profits, factories want to maximize efficiency, investors want to minimize risk, the list just goes on and on. The mathematical tools for optimization are also some of the richest mathematical techniques. They form the cornerstone of an entire industry known as operations research, and advances in this field literally change the world.

The mathematical field is called combinatorial optimization, and the name comes from the goal of finding optimal solutions more efficiently than an exhaustive search through every possibility. This post will introduce the most central problem in all of combinatorial optimization, known as the linear program. Even better, we know how to efficiently solve linear programs, so in future posts we’ll write a program that computes the most affordable diet while meeting the recommended health standard.

Generalizing a Specific Linear Program

Most optimization problems have two parts: an objective function, the thing we want to maximize or minimize, and constraints, rules we must abide by to ensure we get a valid solution. As a simple example you may want to minimize the amount of time you spend doing your taxes (objective function), but you certainly can’t spend a negative amount of time on them (a constraint).

The following more complicated example is the centerpiece of this post. Most people want to minimize the amount of money spent on food. At the same time, one needs to maintain a certain level of nutrition. For males ages 19-30, the United States National Institute for Health recommends 3.7 liters of water per day, 1,000 milligrams of calcium per day, 90 milligrams of vitamin C per day, etc.

We can set up this nutrition problem mathematically, just using a few toy variables. Say we had the option to buy some combination of oranges, milk, and broccoli. Some rough estimates [1] give the following content/costs of these foods. For 0.272 USD you can get 100 grams of orange, containing a total of 53.2mg of calcium, 40mg of vitamin C, and 87g of water. For 0.100 USD you can get 100 grams of whole milk, containing 276mg of calcium, 0mg of vitamin C, and 87g of water. Finally, for 0.381 USD you can get 100 grams of broccoli containing 47mg of calcium, 89.2mg of vitamin C, and 91g of water. Here’s a table summarizing this information:

Nutritional content and prices for 100g of three foods

Food         calcium(mg)     vitamin C(mg)      water(g)   price(USD/100g)
Broccoli     47              89.2               91         0.381
Whole milk   276             0                  87         0.100
Oranges      40              53.2               87         0.272

Some observations: broccoli is more expensive but gets the most of all three nutrients, whole milk doesn’t have any vitamin C but gets a ton of calcium for really cheap, and oranges are a somewhere in between. So you could probably tinker with the quantities and figure out what the cheapest healthy diet is. The problem is what happens when we incorporate hundreds or thousands of food items and tens of nutrient recommendations. This simple example is just to help us build up a nice formality.

So let’s continue doing that. If we denote by b the number of 100g units of broccoli we decide to buy, and m the amount of milk and r the amount of oranges, then we can write the daily cost of food as

\displaystyle \text{cost}(b,m,r) = 0.381 b + 0.1 m + 0.272 r

In the interest of being compact (and again, building toward the general linear programming formulation) we can extract the price information into a single cost vector c = (0.381, 0.1, 0.272), and likewise write our variables as a vector x = (b,m,r). We’re implicitly fixing an ordering on the variables that is maintained throughout the problem, but the choice of ordering doesn’t matter. Now the cost function is just the inner product (dot product) of the cost vector and the variable vector \left \langle c,x \right \rangle. For some reason lots of people like to write this as c^Tx, where c^T denotes the transpose of a matrix, and we imagine that c and x are matrices of size 3 \times 1. I’ll stick to using the inner product bracket notation.

Now for each type of food we get a specific amount of each nutrient, and the sum of those nutrients needs to be bigger than the minimum recommendation. For example, we want at least 1,000 mg of calcium per day, so we require that 1000 \leq 47b + 276m + 40r. Likewise, we can write out a table of the constraints by looking at the columns of our table above.

\displaystyle \begin{matrix} 91b & + & 87m & + & 87r & \geq & 3700 & \text{(water)}\\ 47b & + & 276m & + & 40r & \geq & 1000 & \text{(calcium)} \\ 89.2b & + & 0m & + & 53.2r & \geq & 90 & \text{(vitamin C)} \end{matrix}

In the same way that we extracted the cost data into a vector to separate it from the variables, we can extract all of the nutrient data into a matrix A, and the recommended minimums into a vector v. Traditionally the letter b is used for the minimums vector, but for now we’re using b for broccoli.

A = \begin{pmatrix} 91 & 87 & 87 \\ 47 & 276 & 40 \\ 89.2 & 0 & 53.2 \end{pmatrix}

v = \begin{pmatrix} 3700 \\ 1000 \\ 90 \end{pmatrix}

And now the constraint is that Ax \geq v, where the \geq means “greater than or equal to in every coordinate.” So now we can write down the more general form of the problem for our specific matrices and vectors. That is, our problem is to minimize \left \langle c,x \right \rangle subject to the constraint that Ax \geq v. This is often written in offset form to contrast it with variations we’ll see in a bit:

\displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to the constraint } Ax \geq v

In general there’s no reason you can’t have a “negative” amount of one variable. In this problem you can’t buy negative broccoli, so we’ll add the constraints to ensure the variables are nonnegative. So our final form is

\displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to } Ax \geq v \\ \text{and } x \geq 0

In general, if you have an m \times n matrix A, a “minimums” vector v \in \mathbb{R}^m, and a cost vector c \in \mathbb{R}^n, the problem of finding the vector x that minimizes the cost function while meeting the constraints is called a linear programming problem or simply a linear program.

To satiate the reader’s burning curiosity, the solution for our calcium/vitamin C problem is roughly x = (1.01, 41.47, 0). That is, you should have about 100g of broccoli and 4.2kg of milk (like 4 liters), and skip the oranges entirely. The daily cost is about 4.53 USD. If this seems awkwardly large, it’s because there are cheaper ways to get water than milk.

100-grams-broccoli

100g of broccoli (image source: 100-grams.blogspot.com)

[1] Water content of fruits and veggiesFood costs in March 2014 in the midwest, and basic known facts about the water density/nutritional content of various foods.

Duality

Now that we’ve seen the general form a linear program and a cute example, we can ask the real meaty question: is there an efficient algorithm that solves arbitrary linear programs? Despite how widely applicable these problems seem, the answer is yes!

But before we can describe the algorithm we need to know more about linear programs. For example, say you have some vector x which satisfies your constraints. How can you tell if it’s optimal? Without such a test we’d have no way to know when to terminate our algorithm. Another problem is that we’ve phrased the problem in terms of minimization, but what about problems where we want to maximize things? Can we use the same algorithm that finds minima to find maxima as well?

Both of these problems are neatly answered by the theory of duality. In mathematics in general, the best way to understand what people mean by “duality” is that one mathematical object uniquely determines two different perspectives, each useful in its own way. And typically a duality theorem provides one with an efficient way to transform one perspective into the other, and relate the information you get from both perspectives. A theory of duality is considered beautiful because it gives you truly deep insight into the mathematical object you care about.

In linear programming duality is between maximization and minimization. In particular, every maximization problem has a unique “dual” minimization problem, and vice versa. The really interesting thing is that the variables you’re trying to optimize in one form correspond to the contraints in the other form! Here’s how one might discover such a beautiful correspondence. We’ll use a made up example with small numbers to make things easy.

So you have this optimization problem

\displaystyle \begin{matrix}  \text{minimize} & 4x_1+3x_2+9x_3 & \\  \text{subject to} & x_1+x_2+x_3 & \geq 6 \\  & 2x_1+x_3 & \geq 2 \\  & x_2+x_3 & \geq 1 & \\ & x_1,x_2,x_3 & \geq 0 \end{matrix}

Just for giggles let’s write out what A and c are.

\displaystyle A = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 0 & 1 \\ 0 & 1 & 1 \end{pmatrix}, c = (4,3,9), v = (6,2,1)

Say you want to come up with a lower bound on the optimal solution to your problem. That is, you want to know that you can’t make 4x_1 + 3x_2 + 9x_3 smaller than some number m. The constraints can help us derive such lower bounds. In particular, every variable has to be nonnegative, so we know that 4x_1 + 3x_2 + 9x_3 \geq x_1 + x_2 + x_3 \geq 6, and so 6 is a lower bound on our optimum. Likewise,

\displaystyle \begin{aligned}4x_1+3x_2+9x_3 & \geq 4x_1+4x_3+3x_2+3x_3 \\ &=2(2x_1 + x_3)+3(x_2+x_3) \\ & \geq 2 \cdot 2 + 3 \cdot 1 \\ &=7\end{aligned}

and that’s an even better lower bound than 6. We could try to write this approach down in general: find some numbers y_1, y_2, y_3 that we’ll use for each constraint to form

\displaystyle y_1(\text{constraint 1}) + y_2(\text{constraint 2}) + y_3(\text{constraint 3})

To make it a valid lower bound we need to ensure that the coefficients of each of the x_i are smaller than the coefficients in the objective function (i.e. that the coefficient of x_1 ends up less than 4). And to make it the best lower bound possible we want to maximize what the right-hand-size of the inequality would be: y_1 6 + y_2 2 + y_3 1. If you write out these equations and the constraints you get our “lower bound” problem written as

\displaystyle \begin{matrix} \text{maximize} & 6y_1 + 2y_2 + y_3 & \\ \text{subject to} & y_1 + 2y_2 & \leq 4 \\ & y_1 + y_3 & \leq 3 \\ & y_1+y_2 + y_3 & \leq 9 \\ & y_1,y_2,y_3 & \geq 0 \end{matrix}

And wouldn’t you know, the matrix providing the constraints is A^T, and the vectors c and v switched places.

\displaystyle A^T = \begin{pmatrix} 1 & 2 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{pmatrix}

This is no coincidence. All linear programs can be transformed in this way, and it would be a useful exercise for the reader to turn the above maximization problem back into a minimization problem by the same technique (computing linear combinations of the constraints to make upper bounds). You’ll be surprised to find that you get back to the original minimization problem! This is part of what makes it “duality,” because the dual of the dual is the original thing again. Often, when we fix the “original” problem, we call it the primal form to distinguish it from the dual form. Usually the primal problem is the one that is easy to interpret.

(Note: because we’re done with broccoli for now, we’re going to use b to denote the constraint vector that used to be v.)

Now say you’re given the data of a linear program for minimization, that is the vectors c, b and matrix A for the problem, “minimize \left \langle c, x \right \rangle subject to Ax \geq b; x \geq 0.” We can make a general definition: the dual linear program is the maximization problem “maximize \left \langle b, y \right \rangle subject to A^T y \leq c, y \geq 0.” Here y is the new set of variables and the superscript T denotes the transpose of the matrix. The constraint for the dual is often written y^T A \leq c^T, again identifying vectors with a single-column matrices, but I find the swamp of transposes pointless and annoying (why do things need to be columns?).

Now we can actually prove that the objective function for the dual provides a bound on the objective function for the original problem. It’s obvious from the work we’ve done, which is why it’s called the weak duality theorem.

Weak Duality Theorem: Let c, A, b be the data of a linear program in the primal form (the minimization problem) whose objective function is \left \langle c, x \right \rangle. Recall that the objective function of the dual (maximization) problem is \left \langle b, y \right \rangle. If x,y are feasible solutions (satisfy the constraints of their respective problems), then

\left \langle b, y \right \rangle \leq \left \langle c, x \right \rangle

In other words, the maximum of the dual is a lower bound on the minimum of the primal problem and vice versa. Moreover, any feasible solution for one provides a bound on the other.

Proof. The proof is pleasingly simple. Just inspect the quantity \left \langle A^T y, x \right \rangle = \left \langle y, Ax \right \rangle. The constraints from the definitions of the primal and dual give us that

\left \langle y, b \right \rangle \leq \left \langle y, Ax \right \rangle = \left \langle A^Ty, x \right \rangle \leq \left \langle c,x \right \rangle

The inequalities follow from the linear algebra fact that if the u in \left \langle u,v \right \rangle is nonnegative, then you can only increase the size of the product by increasing the components of v. This is why we need the nonnegativity constraints.

In fact, the world is much more pleasing. There is a theorem that says the two optimums are equal!

Strong Duality Theorem: If there are any solutions x,y to the primal (minimization) problem and the dual (maximization) problem, respectively, then the two problems also have optimal solutions x^*, y^*, and two candidate solutions x^*, y^* are optimal if and only if they produce equal objective values \left \langle c, x^* \right \rangle = \left \langle y^*, b \right \rangle.

The proof of this theorem is a bit more convoluted than the weak duality theorem, and the key technique is a lemma of Farkas and its variations. See the second half of these notes for a full proof. The nice thing is that this theorem gives us a way to tell if an algorithm to solve linear programs is done: maintain a pair of feasible solutions to the primal and dual problems, improve them by some rule, and stop when the two solutions give equal objective values. The hard part, then, is finding a principled and guaranteed way to improve a given pair of solutions.

On the other hand, you can also prove the strong duality theorem by inventing an algorithm that provably terminates. We’ll see such an algorithm, known as the simplex algorithm in the next post. Sneak peek: it’s a lot like Gaussian elimination. Then we’ll use the algorithm (or an equivalent industry-strength version) to solve a much bigger nutrition problem.

In fact, you can do a bit better than the strong duality theorem, in terms of coming up with a stopping condition for a linear programming algorithm. You can observe that an optimal solution implies further constraints on the relationship between the primal and the dual problems. In particular, this is called the complementary slackness conditions, and they essentially say that if an optimal solution to the primal has a positive variable then the corresponding constraint in the dual problem must be tight (is an equality) to get an optimal solution to the dual. The contrapositive says that if some constraint is slack, or a strict inequality, then either the corresponding variable is zero or else the solution is not optimal. More formally,

Theorem (Complementary Slackness Conditions): Let A, c, b be the data of the primal form of a linear program, “minimize \left \langle c, x \right \rangle subject to Ax \geq b, x \geq 0.” Then x^*, y^* are optimal solutions to the primal and dual problems if any only if all of the following conditions hold.

  • x^*, y^* are both feasible for their respective problems.
  • Whenever x^*_i > 0 the corresponding constraint A^T_i y^* = c_i is an equality.
  • Whenever y^*_j > 0 the corresponding constraint A_j x^* = b_j is an equality.

Here we denote by M_i the i-th row of the matrix M and v_i to denote the i-th entry of a vector. Another way to write the condition using vectors instead of English is

\left \langle x^*, A^T y^* - c \right \rangle = 0
\left \langle y^*, Ax^* - b \right \rangle

The proof follows from the duality theorems, and just involves pushing around some vector algebra. See section 6.2 of these notes.

One can interpret complementary slackness in linear programs in a lot of different ways. For us, it will simply be a termination condition for an algorithm: one can efficiently check all of these conditions for the nonzero variables and stop if they’re all satisfied or if we find a variable that violates a slackness condition. Indeed, in more mature optimization analyses, the slackness condition that is more egregiously violated can provide evidence for where a candidate solution can best be improved. For a more intricate and detailed story about how to interpret the complementary slackness conditions, see Section 4 of these notes by Joel Sobel.

Finally, before we close we should note there are geometric ways to think about linear programming. I have my preferred visualization in my head, but I have yet to find a suitable animation on the web that replicates it. Here’s one example in two dimensions. The set of constraints define a convex geometric region in the plane

The constraints define a convex area of "feasible solutions." Image source: Wikipedia.

The constraints define a convex area of “feasible solutions.” Image source: Wikipedia.

Now the optimization function f(x) = \left \langle c,x \right \rangle is also a linear function, and if you fix some output value y = f(x) this defines a line in the plane. As y changes, the line moves along its normal vector (that is, all these fixed lines are parallel). Now to geometrically optimize the target function, we can imagine starting with the line f(x) = 0, and sliding it along its normal vector in the direction that keeps it in the feasible region. We can keep sliding it in this direction, and the maximum of the function is just the last instant that this line intersects the feasible region. If none of the constraints are parallel to the family of lines defined by f, then this is guaranteed to occur at a vertex of the feasible region. Otherwise, there will be a family of optima lying anywhere on the line segment of last intersection.

In higher dimensions, the only change is that the lines become affine subspaces of dimension n-1. That means in three dimensions you’re sliding planes, in four dimensions you’re sliding 3-dimensional hyperplanes, etc. The facts about the last intersection being a vertex or a “line segment” still hold. So as we’ll see next time, successful algorithms for linear programming in practice take advantage of this observation by efficiently traversing the vertices of this convex region. We’ll see this in much more detail in the next post.

Until then!

A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

\displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)

The wedge \wedge denotes a logical AND and the vee \vee denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an x_i or the negation \overline{x_i}, and connectives \wedge and \vee, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form C_1 \vee C_2 \vee C_3 where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the \vee‘s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph G, we’ll construct a set S_G of labeled truth assignments, and we’ll define the distribution D to be the uniform distribution over those truth assignments used in S_G. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in S_G exactly how we labeled them, and we set the allowed error \varepsilon to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of G). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of G).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph G with n nodes v_1, \dots, v_n and a set of m undirected edges E, we construct a set of examples with positive labels S^+ and one with negative examples S^-. The examples are truth assignments to n variables, which we label x_1, \dots, x_n, and we identify a truth assignment to the \left \{ 0,1 \right \}-valued vector (x_1, x_2, \dots, x_n) in the usual way (true is 1, false is 0).

The positive examples S^+ are simple: for each v_i add a truth assignment x_i = T, x_j = F for j \neq i. I.e., the binary vector is (1, \dots, 1,0,1, \dots, 1), and the zero is in the i-th position.

The negative examples S^- come from the edges. For each edge (v_i, v_j) \in E, we add the example with a zero in the i-th and j-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

PAC-reduction

Claim: G is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula \varphi.

Again, consistent just means that \varphi is satisfied by every truth assignment in S^+ and unsatisfied by every example in S^-. Since we chose our distribution to be uniform over S^+ \cup S^-, we don’t care what \varphi does elsewhere.

Indeed, if G is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let T_R be the AND of all the literals x_i for which vertex v_i is not red. For each such i, the corresponding example in S^+ will satisfy T_R, because we put a zero in the i-th position and ones everywhere else. Similarly, no example in S^- will make T_R true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is (v_1,v_2). Then the corresponding negative example is (0,0,1). Unless both v_1 and v_2 are colored red, one of x_1, x_2 will have to be ANDed as part of T_R. But the example has a zero for both x_1 and x_2, so T_R would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get T_R \vee T_B \vee T_Y. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF \varphi. We need to construct a three coloring of G. It goes in largely the same way: label the clauses \varphi = T_R \vee T_B \vee T_Y for Red, Blue, and Yellow, and then color a vertex v_i the color of the clause that is satisfied by the corresponding example in S^+. There must be some clause that does this because \varphi is consistent with S^+, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge (v_i, v_j), and both v_i and v_j are colored, say, blue. Look at the clause T_B: since v_i and v_j are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make T_B true. Since those two positive examples differ in both their i-th and j-th positions, T_B can’t have any of the literals x_i, \overline{x_i}, x_j, \overline{x_j}. But then the negative example for the edge would satisfy T_B because it has 1’s everywhere except i,j! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph G; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick \delta < 1/2 and \varepsilon \leq 1/(2|S^+ \cup S^-|) in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least 1 / |S^+ \cup S^-|, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class C (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within C. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning C suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class \mathsf{C} over a set X is efficiently PAC-learnable using the hypothesis class \mathsf{H} if there exists an algorithm A(\varepsilon, \delta) with access to a query function for \mathsf{C} and runtime O(\text{poly}(1/\varepsilon, 1/\delta)), such that for all c \in \mathsf{C}, all distributions D over X, and all 0 < \delta , \varepsilon < 1/2, the probability that A produces a hypothesis h \in \mathsf{H} with error at most \varepsilon is at least 1-\delta.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!