Optimization Models for Subset Cover

In a recent newsletter article I complained about how researchers mislead about the applicability of their work. I gave SAT solvers as an example. People provided interesting examples in response, but what was new to me was the concept of SMT (Satisfiability Modulo Theories), an extension to SAT. SMT seems to have more practical uses than vanilla SAT (see the newsletter for details).

I wanted to take some time to explore SMT solvers, and I landed on Z3, an open-source SMT solver from Microsoft. In particular, I wanted to compare it to ILP (Integer Linear Programing) solvers, which I know relatively well. I picked a problem that I thought would work better for SAT-ish solvers than for ILPs: subset covering (explained in the next section). If ILP still wins against Z3, then that would be not so great for the claim that SMT is a production strength solver.

All the code used for this post is on Github.

Subset covering

A subset covering is a kind of combinatorial design, which can be explained in terms of magic rings.

An adventurer stumbles upon a chest full of magic rings. Each ring has a magical property, but some pairs of rings, when worn together on the same hand, produce a combined special magical effect distinct to that pair.

The adventurer would like to try all pairs of rings to catalogue the magical interactions. With only five fingers, how can we minimize the time spent trying on rings?

Mathematically, the rings can be described as a set $ X$ of size $ n$. We want to choose a family $ F$ of subsets of $ X$, with each subset having size 5 (five fingers), such that each subset of $ X$ of size 2 (pairs of rings) is contained in some subset of $ F$. And we want $ F$ to be as small as possible.

Subset covering is not a “production worthy” problem. Rather, I could imagine it’s useful in some production settings, but I haven’t heard of one where it is actually used. I can imagine, for instance, that a cluster of machines has some bug occurring seemingly at random for some point-to-point RPCs, and in tracking down the problem, you want to deploy a test change to subsets of servers to observe the bug occurring. Something like an experiment design problem.

If you generalize the “5” in “5 fingers” to an arbitrary positive integer $ k$, and the “2” in “2 rings” to $ l < k$, then we have the general subset covering problem. Define $ M(n, k, l)$ to be the minimal number of subsets of size $ k$ needed to cover all subsets of size $ l$. This problem was studied by Erdős, with a conjecture subsequently proved by Vojtěch Rödl, that asymptotically $ M(n,k,l)$ grows like $ \binom{n}{l} / \binom{k}{l}$. Additional work by Joel Spencer showed that a greedy algorithm is essentially optimal.

However, all of the constructive algorithms in these proofs involve enumerating all $ \binom{n}{k}$ subsets of $ X$. This wouldn’t scale very well. You can alternatively try a “random” method, incurring a typically $ \log(r)$ factor of additional sets required to cover a $ 1 – 1/r$ fraction of the needed subsets. This is practical, but imperfect.

To the best of my knowledge, there is no exact algorithm, that both achieves the minimum and is efficient in avoiding constructing all $ \binom{n}{k}$ subsets. So let’s try using an SMT solver. I’ll be using the Python library for Z3.

Baseline: brute force Z3

For a baseline, let’s start with a simple Z3 model that enumerates all the possible subsets that could be chosen. This leads to an exceedingly simple model to compare the complex models against.

Define boolean variables $ \textup{Choice}_S$ which is true if and only if the subset $ S$ is chosen (I call this a “choice set”). Define boolean variables $ \textup{Hit}_T$ which is true if the subset $ T$ (I call this a “hit set”) is contained in a chosen choice set. Then the subset cover problem can be defined by two sets of implications.

First, if $ \textup{Choice}_S$ is true, then so must all $ \textup{Hit}_T$ for $ T \subset S$. E.g., for $ S = \{ 1, 2, 3 \}$ and $ l=2$, we get

$ \displaystyle \begin{aligned} \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(1,2)} \\ \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(1,3)} \\ \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(2,3)} \end{aligned}$

In Python this looks like the following (note this program has some previously created lookups and data structures containing the variables)

for choice_set in choice_sets:
  for hit_set_key in combinations(choice_set.elements, hit_set_size):
    hit_set = hit_set_lookup[hit_set_key]
    implications.append(
      z3.Implies(choice_set.variable, hit_set.variable))

Second, if $ \textup{Hit}_T$ is true, it must be that some $ \textup{Choice}_S$ is true for some $ S$ containing $ T$ as a subset. For example,

$ \displaystyle \begin{aligned} \textup{Hit}_{(1,2)} &\implies \\ & \textup{Choice}_{(1,2,3,4)} \textup{ OR} \\ & \textup{Choice}_{(1,2,3,5)} \textup{ OR} \\ & \textup{Choice}_{(1,2,4,5)} \textup{ OR } \cdots \\ \end{aligned}$

In code,

for hit_set in hit_sets.values():
  relevant_choice_set_vars = [
    choice_set.variable
    for choice_set in hit_set_to_choice_set_lookup[hit_set]
  ]
  implications.append(
    z3.Implies(
      hit_set.variable, 
      z3.Or(*relevant_choice_set_vars)))

Next, in this experiment we’re allowing the caller to specify the number of choice sets to try, and the solver should either return SAT or UNSAT. From that, we can use a binary search to find the optimal number of sets to pick. Thus, we have to limit the number of $ \textup{Choice}_S$ that are allowed to be true and false. Z3 supports boolean cardinality constraints, apparently with a special solver to handle problems that have them. Otherwise, the process of encoding cardinality constraints as SAT formulas is not trivial (and the subject of active research). But the code is simple enough:

args = [cs.variable for cs in choice_sets] + [parameters.num_choice_sets]
choice_sets_at_most = z3.AtMost(*args)
choice_sets_at_least = z3.AtLeast(*args)

Finally, we must assert that every $ \textup{Hit}_T$ is true.

solver = z3.Solver()
for hit_set in hit_sets.values():
  solver.add(hit_set.variable)

for impl in implications:
  solver.add(impl)

solver.add(choice_sets_at_most)
solver.add(choice_sets_at_least)

Running it for $ n=7, k=3, l=2$, and seven choice sets (which is optimal), we get

>>> SubsetCoverZ3BruteForce().solve(
  SubsetCoverParameters(
    num_elements=7,
    choice_set_size=3,
    hit_set_size=2,
    num_choice_sets=7))
[(0, 1, 3), (0, 2, 4), (0, 5, 6), (1, 2, 6), (1, 4, 5), (2, 3, 5), (3, 4, 6)]
SubsetCoverSolution(status=<SolveStatus.SOLVED: 1>, solve_time_seconds=0.018305063247680664)

Interestingly, Z3 refuses to solve marginally larger instances. For instance, I tried the following and Z3 times out around $ n=12, k=6$ (about 8k choice sets):

from math import comb

for n in range(8, 16):
  k = int(n / 2)
  l = 3
  max_num_sets = int(2 * comb(n, l) / comb(k, l))
  params = SubsetCoverParameters(
    num_elements=n,
    choice_set_size=k,
    hit_set_size=l,                                   
    num_choice_sets=max_num_sets)

    print_table(
      params, 
      SubsetCoverZ3BruteForce().solve(params), 
      header=(n==8))

After taking a long time to generate the larger models, Z3 exceeds my 15 minute time limit, suggesting exponential growth:

status               solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED   0.0271              8             4                3             28
SolveStatus.SOLVED   0.0346              9             4                3             42
SolveStatus.SOLVED   0.0735              10            5                3             24
SolveStatus.SOLVED   0.1725              11            5                3             33
SolveStatus.SOLVED   386.7376            12            6                3             22
SolveStatus.UNKNOWN  900.1419            13            6                3             28
SolveStatus.UNKNOWN  900.0160            14            7                3             20
SolveStatus.UNKNOWN  900.0794            15            7                3             26

An ILP model

Next we’ll see an ILP model for the sample problem. Note there are two reasons I expect the ILP model to fall short. First, the best solver I have access to is SCIP, which, despite being quite good is, in my experience, about an order of magnitude slower than commercial alternatives like Gurobi. Second, I think this sort of problem seems to not be very well suited to ILPs. It would take quite a bit longer to explain why (maybe another post, if you’re interested), but in short well-formed ILPs have easily found feasible solutions (this one does not), and the LP-relaxation of the problem should be as tight as possible. I don’t think my formulation is very tight, but it’s possible there is a better formulation.

Anyway, the primary difference in my ILP model from brute force is that the number of choice sets is fixed in advance, and the members of the choice sets are model variables. This allows us to avoid enumerating all choice sets in the model.

In particular, $ \textup{Member}_{S,i} \in \{ 0, 1 \}$ is a binary variable that is 1 if and only if element $ i$ is assigned to be in set $ S$. And $ \textup{IsHit}_{T, S} \in \{0, 1\}$ is 1 if and only if the hit set $ T$ is a subset of $ S$. Here “$ S$” is an index over the subsets, rather than the set itself, because we don’t know what elements are in $ S$ while building the model.

For the constraints, each choice set $ S$ must have size $ k$:

$ \displaystyle \sum_{i \in X} \textup{Member}_{S, i} = k$

Each hit set $ T$ must be hit by at least one choice set:

$ \displaystyle \sum_{S} \textup{IsHit}_{T, S} \geq 1$

Now the tricky constraint. If a hit set $ T$ is hit by a specific choice set $ S$ (i.e., $ \textup{IsHit}_{T, S} = 1$) then all the elements in $ T$ must also be members of $ S$.

$ \displaystyle \sum_{i \in T} \textup{Member}_{S, i} \geq l \cdot \textup{IsHit}_{T, S}$

This one works by the fact that the left-hand side (LHS) is bounded from below by 0 and bounded from above by $ l = |T|$. Then $ \textup{IsHit}_{T, S}$ acts as a switch. If it is 0, then the constraint is vacuous since the LHS is always non-negative. If $ \textup{IsHit}_{T, S} = 1$, then the right-hand side (RHS) is $ l = |T|$ and this forces all variables on the LHS to be 1 to achieve it.

Because we fixed the number of choice sets as a parameter, the objective is 1, and all we’re doing is looking for a feasible solution. The full code is here.

On the same simple example as the brute force

>>> SubsetCoverILP().solve(
  SubsetCoverParameters(
    num_elements=7,
    choice_set_size=3,
    hit_set_size=2,
    num_choice_sets=7))
[(0, 1, 3), (0, 2, 6), (0, 4, 5), (1, 2, 4), (1, 5, 6), (2, 3, 5), (3, 4, 6)]
SubsetCoverSolution(status=<SolveStatus.SOLVED: 1>, solve_time_seconds=0.1065816879272461)

It finds the same solution in about 10x the runtime as the brute force Z3 model, though still well under one second.

On the “scaling” example, it fares much worse. With a timeout of 15 minutes, it solves $ n=8,$ decently fast, $ n=9,12$ slowly, and times out on the rest.

status               solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED   1.9969              8             4                3             28
SolveStatus.SOLVED   306.4089            9             4                3             42
SolveStatus.UNKNOWN  899.8842            10            5                3             24
SolveStatus.UNKNOWN  899.4849            11            5                3             33
SolveStatus.SOLVED   406.9502            12            6                3             22
SolveStatus.UNKNOWN  902.7807            13            6                3             28
SolveStatus.UNKNOWN  900.0826            14            7                3             20
SolveStatus.UNKNOWN  900.0731            15            7                3             26

A Z3 Boolean Cardinality Model

The next model uses Z3. It keeps the concept of Member and Hit variables, but they are boolean instead of integer. It also replaces the linear constraints with implications. The constraint that forces a Hit set’s variable to be true when some Choice set contains all its elements is (for each $ S, T$)

$ \displaystyle \left ( \bigwedge_{i \in T} \textup{Member}_{S, i} \right ) \implies \textup{IsHit}_T$

Conversely, A Hit set’s variable being true implies its members are in some choice set.

$ \displaystyle \textup{IsHit}_T \implies \bigvee_{S} \bigwedge_{i \in T} \textup{Member}_{S, i}$

Finally, we again use boolean cardinality constraints AtMost and AtLeast so that each choice set has the right size.

The results are much better than the ILP: it solves all of the instances in under 3 seconds

status              solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED  0.0874              8             4                3             28
SolveStatus.SOLVED  0.1861              9             4                3             42
SolveStatus.SOLVED  0.1393              10            5                3             24
SolveStatus.SOLVED  0.2845              11            5                3             33
SolveStatus.SOLVED  0.2032              12            6                3             22
SolveStatus.SOLVED  1.3661              13            6                3             28
SolveStatus.SOLVED  0.8639              14            7                3             20
SolveStatus.SOLVED  2.4877              15            7                3             26

A Z3 integer model

Z3 supports implications on integer equation equalities, so we can try a model that leverages this by essentially converting the boolean model to one where the variables are 0-1 integers, and the constraints are implications on equality of integer formulas (all of the form “variable = 1”).

I expect this to perform worse than the boolean model, even though the formulation is almost identical. The details of the model are here, and it’s so similar to the boolean model above that it needs no extra explanation.

The runtime is much worse, but surprisingly it still does better than the ILP model.

status              solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED  2.1129              8             4                3             28
SolveStatus.SOLVED  14.8728             9             4                3             42
SolveStatus.SOLVED  7.6247              10            5                3             24
SolveStatus.SOLVED  25.0607             11            5                3             33
SolveStatus.SOLVED  30.5626             12            6                3             22
SolveStatus.SOLVED  63.2780             13            6                3             28
SolveStatus.SOLVED  57.0777             14            7                3             20
SolveStatus.SOLVED  394.5060            15            7                3             26

Harder instances

So far all the instances we’ve been giving the solvers are “easy” in a sense. In particular, we’ve guaranteed there’s a feasible solution, and it’s easy to find. We’re giving roughly twice as many sets as are needed. There are two ways to make this problem harder. One is to test on unsatisfiable instances, which can be harder because the solver has to prove it can’t work. Another is to test on satisfiable instances that are hard to find, such as those satisfiable instances where the true optimal number of choice sets is given as the input parameter. The hardest unsatisfiable instances are also the ones where the number of choice sets allowed is one less than optimal.

Let’s test those situations. Since $ M(7, 3, 2) = 7$, we can try with 7 choice sets and 6 choice sets.

For 7 choice sets (the optimal value), all the solvers do relatively well

method                    status  solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SubsetCoverILP            SOLVED  0.0843              7             3                2             7
SubsetCoverZ3Integer      SOLVED  0.0938              7             3                2             7
SubsetCoverZ3BruteForce   SOLVED  0.0197              7             3                2             7
SubsetCoverZ3Cardinality  SOLVED  0.0208              7             3                2             7

For 6, the ILP struggles to prove it’s infeasible, and the others do comparatively much better (at least 17x better).

method                    status      solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SubsetCoverILP            INFEASIBLE  120.8593            7             3                2             6
SubsetCoverZ3Integer      INFEASIBLE  3.0792              7             3                2             6
SubsetCoverZ3BruteForce   INFEASIBLE  0.3384              7             3                2             6
SubsetCoverZ3Cardinality  INFEASIBLE  7.5781              7             3                2             6

This seems like hard evidence that Z3 is better than ILPs for this problem (and it is), but note that the same test on $ n=8$ fails for all models. They can all quickly prove $ 8 < M(8, 3, 2) \leq 11$, but time out after twenty minutes when trying to determine if $ M(8, 3, 2) \in \{ 9, 10 \}$. Note that $ k=3, l=2$ is the least complex choice for the other parameters, so it seems like there’s not much hope to find $ M(n, k, l)$ for any seriously large parameters, like, say, $ k=6$.

Thoughts

These experiments suggest what SMT solvers can offer above and beyond ILP solvers. Disjunctions and implications are notoriously hard to model in an ILP. You often need to add additional special variables, or do tricks like the one I did that only work in some situations and which can mess with the efficiency of the solver. With SMT, implications are trivial to model, and natively supported by the solver.

Aside from reading everything I could find on Z3, there seems to be little advice on modeling to help the solver run faster. There is a ton of literature for this in ILP solvers, but if any readers see obvious problems with my SMT models, please chime in! I’d love to hear from you. Even without that, I am pretty impressed by how fast the solves finish for this subset cover problem (which this experiment has shown me is apparently a very hard problem).

However, there’s an elephant in the room. These models are all satisfiability/feasibility checks on a given solution. What is not tested here is optimization, in the sense of having the number of choice sets used be minimized directly by the solver. In a few experiments on even simpler models, z3 optimization is quite slow. And while I know how I’d model the ILP version of the optimization problem, given that it’s quite slow to find a feasible instance when the optimal number of sets is given as a parameter, it seems unlikely that it will be fast when asked to optimize. I will have to try that another time to be sure.

Also, I’d like to test the ILP models on Gurobi, but I don’t have a personal license. There’s also the possibility that I can come up with a much better ILP formulation, say, with a tighter LP relaxation. But these will have to wait for another time.

In the end, this experiment has given me some more food for thought, and concrete first-hand experience, on the use of SMT solvers.

Boolean Logic in Polynomials

Problem: Express a boolean logic formula using polynomials. I.e., if an input variable $ x$ is set to $ 0$, that is interpreted as false, while $ x=1$ is interpreted as true. The output of the polynomial should be 0 or 1 according to whether the formula is true or false as a whole.

Solution: You can do this using a single polynomial.

Illustrating with an example: the formula is $ \neg[(a \vee b) \wedge (\neg c \vee d)]$ also known as

not((a or b) and (not c or d))

The trick is to use multiplication for “and” and $ 1-x$ for “not.” So $ a \wedge b$ would be $ x_1 x_2$, and $ \neg z$ would be $ 1-z$. Indeed, if you have two binary variables $ x$ and $ y$ then $ xy$ is 1 precisely when both are 1, and zero when either variable is zero. Likewise, $ 1-x = 1$ if $ x$ is zero and zero if $ x$ is one.

Combine this with deMorgan’s rule to get any formula. $ a \vee b = \neg(\neg a \wedge \neg b)$ translates to $ 1 – (1-a)(1-b)$. For our example above,

$ \displaystyle f(x_1, x_2, x_3, x_4) = 1 – (1 – (1-a)(1-b))(1 – c(1-d))$

Which expands to

$ \displaystyle 1 – a – b + ab + (1-d)(ac + bc – abc)$

If you plug in $ a = 1, b = 0, c = 1, d = 0$ you get True in the original formula (because “not c or d” is False), and likewise the polynomial is

$ \displaystyle 1 – 1 – 0 + 0 + (1-0)(1 + 0 – 0) = 1$

You can verify the rest work yourself, using the following table as a guide:

0, 0, 0, 0 -&gt; 1
0, 0, 0, 1 -&gt; 1
0, 0, 1, 0 -&gt; 1
0, 0, 1, 1 -&gt; 1
0, 1, 0, 0 -&gt; 0
0, 1, 0, 1 -&gt; 0
0, 1, 1, 0 -&gt; 1
0, 1, 1, 1 -&gt; 0
1, 0, 0, 0 -&gt; 0
1, 0, 0, 1 -&gt; 0
1, 0, 1, 0 -&gt; 1
1, 0, 1, 1 -&gt; 0
1, 1, 0, 0 -&gt; 0
1, 1, 0, 1 -&gt; 0
1, 1, 1, 0 -&gt; 1
1, 1, 1, 1 -&gt; 0

Discussion: This trick is used all over CS theory to embed boolean logic within polynomials, and it makes the name “boolean algebra” obvious, because it’s just a subset of normal algebra.

Moreover, since boolean satisfiability—the problem of algorithmically determining if a boolean formula has a satisfying assignment (a choice of variables evaluating to true)—is NP-hard, this can be used to show certain problems relating to multivariable polynomials is also hard. For example, finding roots of multivariable polynomials (even if you knew nothing about algebraic geometry) is hard because you’d run into NP-hardness by simply considering the subset of polynomials coming from boolean formulas.

Here’s a more interesting example, related to the kinds of optimization problems that show up in modern machine learning. Say you want to optimize a polynomial $ f(x)$ subject to a set of quadratic equality constraints. This is NP-hard. Here’s why.

Let $ \varphi$ be a boolean formula, and $ f_\varphi$ its corresponding polynomial. First, each variable $ x_i$ used in the polynomial can be restricted to binary values via the constraint $ x_i(x_i – 1) = 0$.

You can even show NP-hardness if the target function to optimize is only quadratic. As an exercise, one can express the subset sum problem as a quadratic programming problem using similar choices for the constraints. According to this writeup you even express subset sum as a quadratic program with linear constraints.

The moral of the story is simply that multivariable polynomials can encode arbitrary boolean logic.

An Update on “Coloring Resilient Graphs”

A while back I announced a preprint of a paper on coloring graphs with certain resilience properties. I’m pleased to announce that it’s been accepted to the Mathematical Foundations of Computer Science 2014, which is being held in Budapest this year. Since we first published the preprint we’ve actually proved some additional results about resilience, and so I’ll expand some of the details here. I think it makes for a nicer overall picture, and in my opinion it gives a little more justification that resilient coloring is interesting, at least in contrast to other resilience problems.

Resilient SAT

Recall that a “resilient” yes-instance of a combinatorial problem is one which remains a yes-instance when you add or remove some constraints. The way we formalized this for SAT was by fixing variables to arbitrary values. Then the question is how resilient does an instance need to be in order to actually find a certificate for it? In more detail,

Definition: $ r$-resilient $ k$-SAT formulas are satisfiable formulas in $ k$-CNF form (conjunctions of clauses, where each clause is a disjunction of three literals) such that for all choices of $ r$ variables, every way to fix those variables yields a satisfiable formula.

For example, the following 3-CNF formula is 1-resilient:

$ \displaystyle (a \vee b \vee c) \wedge (a \vee \overline{b} \vee \overline{c}) \wedge (\overline{a} \vee \overline{b} \vee c)$

The idea is that resilience may impose enough structure on a SAT formula that it becomes easy to tell if it’s satisfiable at all. Unfortunately for SAT (though this is definitely not the case for coloring), there are only two possibilities. Either the instances are so resilient that they never existed in the first place (they’re vacuously trivial), or the instances are NP-hard. The first case is easy: there are no $ k$-resilient $ k$-SAT formulas. Indeed, if you’re allowed to fix $ k$ variables to arbitrary values, then you can just pick a clause and set all its variables to false. So no formula can ever remain satisfiable under that condition.

The second case is when the resilience is strictly less than the clause size, i.e. $ r$-resilient $ k$-SAT for $ 0 \leq r < k$. In this case the problem of finding a satisfying assignment is NP-hard. We’ll show this via a sequence of reductions which start at 3-SAT, and they’ll involve two steps: increasing the clause size and resilience, and decreasing the clause size and resilience. The trick is in balancing which parts are increased and decreased. I call the first step the “blowing up” lemma, and the second part the “shrinking down” lemma.

Blowing Up and Shrinking Down

Here’s the intuition behind the blowing up lemma. If you give me a regular (unresilient) 3-SAT formula $ \varphi$, what I can do is make a copy of $ \varphi$ with a new set of variables and OR the two things together. Call this $ \varphi^1 \vee \varphi^2$. This is clearly logically equivalent to the original formula; if you give me a satisfying assignment for the ORed thing, I can just see which of the two clauses are satisfied and use that sub-assignment for $ \varphi$, and conversely if you can satisfy $ \varphi$ it doesn’t matter what truth values you choose for the new set of variables. And further you can transform the ORed formula into a 6-SAT formula in polynomial time. Just apply deMorgan’s rules for distributing OR across AND.

Now the choice of a new set of variables allows us to give some resilient. If you fix one variable to the value of your choice, I can always just work with the other set of variables. Your manipulation doesn’t change the satisfiability of the ORed formula, because I’ve added all of this redundancy. So we took a 3-SAT formula and turned it into a 1-resilient 6-SAT formula.

The idea generalizes to the blowing up lemma, which says that you can measure the effects of a blowup no matter what you start with. More formally, if $ s$ is the number of copies of variables you make, $ k$ is the clause size of the starting formula $ \varphi$, and $ r$ is the resilience of $ \varphi$, then blowing up gives you an $ [(r+1)s – 1]$-resilient $ (sk)$-SAT formula. The argument is almost identical to the example above the resilience is more general. Specifically, if you fix fewer than $ (r+1)s$ variables, then the pigeonhole principle guarantees that one of the $ s$ copies of variables has at most $ r$ fixed values, and we can just work with that set of variables (i.e., this small part of the big ORed formula is satisfiable if $ \varphi$ was $ r$-resilient).

The shrinking down lemma is another trick that is similar to the reduction from $ k$-SAT to 3-SAT. There you take a clause like $ v \vee w \vee x \vee y \vee z$ and add new variables $ z_i$ to break up the clause in to clauses of size 3 as follows:

$ \displaystyle (v \vee w \vee z_1) \wedge (\neg z_1 \vee x \vee z_2) \wedge (\neg z_2 \vee y \vee z)$

These are equivalent because your choice of truth values for the $ z_i$ tell me which of these sub-clauses to look for a true literal of the old variables. I.e. if you choose $ z_1 = T, z_2 = F$ then you have to pick either $ y$ or $ z$ to be true. And it’s clear that if you’re willing to double the number of variables (a linear blowup) you can always get a $ k$-clause down to an AND of 3-clauses.

So the shrinking down reduction does the same thing, except we only split clauses in half. For a clause $ C$, call $ C[:k/2]$ the first half of a clause and $ C[k/2:]$ the second half (you can see how my Python training corrupts my notation preference). Then to shrink a clause $ C_i$ down from size $ k$ to size $ \lceil k/2 \rceil + 1$ (1 for the new variable), add a variable $ z_i$ and break $ C_i$ into

$ \displaystyle (C_i[:k/2] \vee z_i) \wedge (\neg z_i \vee C[k/2:])$

and just AND these together for all clauses. Call the original formula $ \varphi$ and the transformed one $ \psi$. The formulas are logically equivalent for the same reason that the $ k$-to-3-SAT reduction works, and it’s already in the right CNF form. So resilience is all we have to measure. The claim is that the resilience is $ q = \min(r, \lfloor k/2 \rfloor)$, where $ r$ is the resilience of $ \varphi$.

The reason for this is that if all the fixed variables are old variables (not $ z_i$), then nothing changes and the resilience of the original $ \phi$ keeps us safe. And each $ z_i$ we fix has no effect except to force us to satisfy a variable in one of the two halves. So there is this implication that if you fix a $ z_i$ you have to also fix a regular variable. Because we can’t guarantee anything if we fix more than $ r$ regular variables, we’d have to stop before fixing $ r$ of the $ z_i$. And because these new clauses have size $ k/2 + 1$, we can’t do this more than $ k/2$ times or else we risk ruining an entire clause. So this give the definition of $ q$. So this proves the shrinking down lemma.

Resilient SAT is always hard

The blowing up and shrinking down lemmas can be used to show that $ r$-resilient $ k$-SAT is NP-hard for all $ r < k$. What we do is reduce from 3-SAT to an $ r$-resilient $ k$-SAT instance in such a way that the 3-SAT formula is satisfiable if and only if the transformed formula is resiliently satisfiable.

What makes these two lemmas work together is that shrinking down shrinks the clause size just barely less than the resilience, and blowing up increases resilience just barely more than it increases clause size. So we can combine these together to climb from 3-SAT up to some high resilience and satisfiability, and then iteratively shrink down until we hit our target.

One might worry that it will take an exponential number of reductions (or a few reductions of exponential size) to get from 3-SAT to the $ (r,k)$ of our choice, but we have a construction that does it in at most four steps, with only a linear initial blowup from 3-SAT to $ r$-resilient $ 3(r+1)$-SAT. Then, to deal with the odd ceilings and floors in the shrinking down lemma, you have to find a suitable larger $ k$ to reduce to (by padding with useless variables, which cannot make the problem easier). And you choose this $ k$ so that you only need at most two applications of shrinking down to get to $ (k-1)$-resilient $ k$-SAT. Our preprint has the gory details (which has an inelegant part that is not worth writing here), but in the end you show that $ (k-1)$-resilient $ k$-SAT is hard, and since that’s the maximal amount of resilience before the problem becomes vacuously trivial, all smaller resilience values are also hard.

So how does this relate to coloring?

I’m happy about this result not just because it answers an open question I’m honestly curious about, but also because it shows that resilient coloring is more interesting. Basically this proves that satisfiability is so hard that no amount of resilience can make it easier in the worst case. But coloring has a gradient of difficulty. Once you get to order $ k^2$ resilience for $ k$-colorable graphs, the coloring problem can be solved efficiently by a greedy algorithm (and it’s not a vacuously empty class of graphs). Another thing on the side is that we use the hardness of resilient SAT to get the hardness results we have for coloring.

If you really want to stretch the implications, you might argue that this says something like “coloring is somewhat easier than SAT,” because we found a quantifiable axis along which SAT remains difficult while coloring crumbles. The caveat is that fixing colors of vertices is not exactly comparable to fixing values of truth assignments (since we are fixing lots of instances by fixing a variable), but at least it’s something concrete.

Coloring is still mostly open, and recently I’ve been going to talks where people are discussing startlingly similar ideas for things like Hamiltonian cycles. So that makes me happy.

Until next time!

A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$ \displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $ \wedge$ denotes a logical AND and the vee $ \vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $ x_i$ or the negation $ \overline{x_i}$, and connectives $ \wedge$ and $ \vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $ C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $ \vee$’s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $ G$, we’ll construct a set $ S_G$ of labeled truth assignments, and we’ll define the distribution $ D$ to be the uniform distribution over those truth assignments used in $ S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $ S_G$ exactly how we labeled them, and we set the allowed error $ \varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $ G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $ G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $ G$ with $ n$ nodes $ v_1, \dots, v_n$ and a set of $ m$ undirected edges $ E$, we construct a set of examples with positive labels $ S^+$ and one with negative examples $ S^-$. The examples are truth assignments to $ n$ variables, which we label $ x_1, \dots, x_n$, and we identify a truth assignment to the $ \left \{ 0,1 \right \}$-valued vector $ (x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $ S^+$ are simple: for each $ v_i$ add a truth assignment $ x_i = T, x_j = F$ for $ j \neq i$. I.e., the binary vector is $ (1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $ i$-th position.

The negative examples $ S^-$ come from the edges. For each edge $ (v_i, v_j) \in E$, we add the example with a zero in the $ i$-th and $ j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

PAC-reduction

Claim: $ G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $ \varphi$.

Again, consistent just means that $ \varphi$ is satisfied by every truth assignment in $ S^+$ and unsatisfied by every example in $ S^-$. Since we chose our distribution to be uniform over $ S^+ \cup S^-$, we don’t care what $ \varphi$ does elsewhere.

Indeed, if $ G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $ T_R$ be the AND of all the literals $ x_i$ for which vertex $ v_i$ is not red. For each such $ i$, the corresponding example in $ S^+$ will satisfy $ T_R$, because we put a zero in the $ i$-th position and ones everywhere else. Similarly, no example in $ S^-$ will make $ T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $ (v_1,v_2)$. Then the corresponding negative example is $ (0,0,1)$. Unless both $ v_1$ and $ v_2$ are colored red, one of $ x_1, x_2$ will have to be ANDed as part of $ T_R$. But the example has a zero for both $ x_1$ and $ x_2$, so $ T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $ T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $ \varphi$. We need to construct a three coloring of $ G$. It goes in largely the same way: label the clauses $ \varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $ v_i$ the color of the clause that is satisfied by the corresponding example in $ S^+$. There must be some clause that does this because $ \varphi$ is consistent with $ S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $ (v_i, v_j)$, and both $ v_i$ and $ v_j$ are colored, say, blue. Look at the clause $ T_B$: since $ v_i$ and $ v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make $ T_B$ true. Since those two positive examples differ in both their $ i$-th and $ j$-th positions, $ T_B$ can’t have any of the literals $ x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $ T_B$ because it has 1’s everywhere except $ i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $ G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $ \delta < 1/2$ and $ \varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $ 1 / |S^+ \cup S^-|$, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $ C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $ C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $ C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $ \mathsf{C}$ over a set $ X$ is efficiently PAC-learnable using the hypothesis class $ \mathsf{H}$ if there exists an algorithm $ A(\varepsilon, \delta)$ with access to a query function for $ \mathsf{C}$ and runtime $ O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $ c \in \mathsf{C}$, all distributions $ D$ over $ X$, and all $ 0 < \delta , \varepsilon < 1/2$, the probability that $ A$ produces a hypothesis $ h \in \mathsf{H}$ with error at most $ \varepsilon$ is at least $ 1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!