Boolean Logic in Polynomials

Problem: Express a boolean logic formula using polynomials. I.e., if an input variable $ x$ is set to $ 0$, that is interpreted as false, while $ x=1$ is interpreted as true. The output of the polynomial should be 0 or 1 according to whether the formula is true or false as a whole.

Solution: You can do this using a single polynomial.

Illustrating with an example: the formula is $ \neg[(a \vee b) \wedge (\neg c \vee d)]$ also known as

not((a or b) and (not c or d))

The trick is to use multiplication for “and” and $ 1-x$ for “not.” So $ a \wedge b$ would be $ x_1 x_2$, and $ \neg z$ would be $ 1-z$. Indeed, if you have two binary variables $ x$ and $ y$ then $ xy$ is 1 precisely when both are 1, and zero when either variable is zero. Likewise, $ 1-x = 1$ if $ x$ is zero and zero if $ x$ is one.

Combine this with deMorgan’s rule to get any formula. $ a \vee b = \neg(\neg a \wedge \neg b)$ translates to $ 1 – (1-a)(1-b)$. For our example above,

$ \displaystyle f(x_1, x_2, x_3, x_4) = 1 – (1 – (1-a)(1-b))(1 – c(1-d))$

Which expands to

$ \displaystyle 1 – a – b + ab + (1-d)(ac + bc – abc)$

If you plug in $ a = 1, b = 0, c = 1, d = 0$ you get True in the original formula (because “not c or d” is False), and likewise the polynomial is

$ \displaystyle 1 – 1 – 0 + 0 + (1-0)(1 + 0 – 0) = 1$

You can verify the rest work yourself, using the following table as a guide:

0, 0, 0, 0 -> 1
0, 0, 0, 1 -> 1
0, 0, 1, 0 -> 1
0, 0, 1, 1 -> 1
0, 1, 0, 0 -> 0
0, 1, 0, 1 -> 0
0, 1, 1, 0 -> 1
0, 1, 1, 1 -> 0
1, 0, 0, 0 -> 0
1, 0, 0, 1 -> 0
1, 0, 1, 0 -> 1
1, 0, 1, 1 -> 0
1, 1, 0, 0 -> 0
1, 1, 0, 1 -> 0
1, 1, 1, 0 -> 1
1, 1, 1, 1 -> 0

Discussion: This trick is used all over CS theory to embed boolean logic within polynomials, and it makes the name “boolean algebra” obvious, because it’s just a subset of normal algebra.

Moreover, since boolean satisfiability—the problem of algorithmically determining if a boolean formula has a satisfying assignment (a choice of variables evaluating to true)—is NP-hard, this can be used to show certain problems relating to multivariable polynomials is also hard. For example, finding roots of multivariable polynomials (even if you knew nothing about algebraic geometry) is hard because you’d run into NP-hardness by simply considering the subset of polynomials coming from boolean formulas.

Here’s a more interesting example, related to the kinds of optimization problems that show up in modern machine learning. Say you want to optimize a polynomial $ f(x)$ subject to a set of quadratic equality constraints. This is NP-hard. Here’s why.

Let $ \varphi$ be a boolean formula, and $ f_\varphi$ its corresponding polynomial. First, each variable $ x_i$ used in the polynomial can be restricted to binary values via the constraint $ x_i(x_i – 1) = 0$.

You can even show NP-hardness if the target function to optimize is only quadratic. As an exercise, one can express the subset sum problem as a quadratic programming problem using similar choices for the constraints. According to this writeup you even express subset sum as a quadratic program with linear constraints.

The moral of the story is simply that multivariable polynomials can encode arbitrary boolean logic.

Zero-One Laws for Random Graphs

Last time we saw a number of properties of graphs, such as connectivity, where the probability that an Erdős–Rényi random graph $ G(n,p)$ satisfies the property is asymptotically either zero or one. And this zero or one depends on whether the parameter $ p$ is above or below a universal threshold (that depends only on $ n$ and the property in question).

To remind the reader, the Erdős–Rényi random “graph” $ G(n,p)$ is a distribution over graphs that you draw from by including each edge independently with probability $ p$. Last time we saw that the existence of an isolated vertex has a sharp threshold at $ (\log n) / n$, meaning if $ p$ is asymptotically smaller than the threshold there will certainly be isolated vertices, and if $ p$ is larger there will certainly be no isolated vertices. We also gave a laundry list of other properties with such thresholds.

One might want to study this phenomenon in general. Even if we might not be able to find all the thresholds we want for a given property, can we classify which properties have thresholds and which do not?

The answer turns out to be mostly yes! For large classes of properties, there are proofs that say things like, “either this property holds with probability tending to one, or it holds with probability tending to zero.” These are called “zero-one laws,” and they’re sort of meta theorems. We’ll see one such theorem in this post relating to constant edge-probabilities in random graphs, and we’ll remark on another at the end.

Sentences about graphs in first order logic

A zero-one law generally works by defining a class of properties, and then applying a generic first/second moment-type argument to every property in the class.

So first we define what kinds of properties we’ll discuss. We’ll pick a large class: anything that can be expressed in first-order logic in the language of graphs. That is, any finite logical statement that uses existential and universal quantifiers over variables, and whose only relation (test) is whether an edge exists between two vertices. We’ll call this test $ e(x,y)$. So you write some sentence $ P$ in this language, and you take a graph $ G$, and you can ask $ P(G) = 1$, whether the graph satisfies the sentence.

This seems like a really large class of properties, and it is, but let’s think carefully about what kinds of properties can be expressed this way. Clearly the existence of a triangle can be written this way, it’s just the sentence

$ \exists x,y,z : e(x,y) \wedge e(y,z) \wedge e(x,z)$

I’m using $ \wedge$ for AND, and $ \vee$ for OR, and $ \neg$ for NOT. Similarly, one can express the existence of a clique of size $ k$, or the existence of an independent set of size $ k$, or a path of a fixed length, or whether there is a vertex of maximal degree $ n-1$.

Here’s a question: can we write a formula which will be true for a graph if and only if it’s connected? Well such a formula seems like it would have to know about how many vertices there are in the graph, so it could say something like “for all $ x,y$ there is a path from $ x$ to $ y$.” It seems like you’d need a family of such formulas that grows with $ n$ to make anything work. But this isn’t a proof; the question remains whether there is some other tricky way to encode connectivity.

But as it turns out, connectivity is not a formula you can express in propositional logic. We won’t prove it here, but we will note at the end of the article that connectivity is in a different class of properties that you can prove has a similar zero-one law.

The zero-one law for first order logic

So the theorem about first-order expressible sentences is as follows.

Theorem: Let $ P$ be a property of graphs that can be expressed in the first order language of graphs (with the $ e(x,y)$ relation). Then for any constant $ p$, the probability that $ P$ holds in $ G(n,p)$ has a limit of zero or one as $ n \to \infty$.

Proof. We’ll prove the simpler case of $ p=1/2$, but the general case is analogous. Given such a graph $ G$ drawn from $ G(n,p)$, what we’ll do is define a countably infinite family of propositional formulas $ \varphi_{k,l}$, and argue that they form a sort of “basis” for all first-order sentences about graphs.

First let’s describe the $ \varphi_{k,l}$. For any $ k,l \in \mathbb{N}$, the sentence will assert that for every set of $ k$ vertices and every set of $ l$ vertices, there is some other vertex connected to the first $ k$ but not the last $ l$.

$ \displaystyle \varphi_{k,l} : \forall x_1, \dots, x_k, y_1, \dots, y_l \exists z : \\ e(z,x_1) \wedge \dots \wedge e(z,x_k) \wedge \neg e(z,y_1) \wedge \dots \wedge \neg e(z,y_l)$.

In other words, these formulas encapsulate every possible incidence pattern for a single vertex. It is a strange set of formulas, but they have a very nice property we’re about to get to. So for a fixed $ \varphi_{k,l}$, what is the probability that it’s false on $ n$ vertices? We want to give an upper bound and hence show that the formula is true with probability approaching 1. That is, we want to show that all the $ \varphi_{k,l}$ are true with probability tending to 1.

Computing the probability: we have $ \binom{n}{k} \binom{n-k}{l}$ possibilities to choose these sets, and the probability that some other fixed vertex $ z$ has the good connections is $ 2^{-(k+l)}$ so the probability $ z$ is not good is $ 1 – 2^{-(k+l)}$, and taking a product over all choices of $ z$ gives the probability that there is some bad vertex $ z$ with an exponent of $ (n – (k + l))$. Combining all this together gives an upper bound of $ \varphi_{k,l}$ being false of:

$ \displaystyle \binom{n}{k}\binom{n-k}{l} (1-2^{-k-1})^{n-k-l}$

And $ k, l$ are constant, so the left two terms are polynomials while the rightmost term is an exponentially small function, and this implies that the whole expression tends to zero, as desired.

Break from proof.

A bit of model theory

So what we’ve proved so far is that the probability of every formula of the form $ \varphi_{k,l}$ being satisfied in $ G(n,1/2)$ tends to 1.

Now look at the set of all such formulas

$ \displaystyle \Phi = \{ \varphi_{k,l} : k,l \in \mathbb{N} \}$

We ask: is there any graph which satisfies all of these formulas? Certainly it cannot be finite, because a finite graph would not be able to satisfy formulas with sufficiently large values of $ l, k > n$. But indeed, there is a countably infinite graph that works. It’s called the Rado graph, pictured below.

rado

The Rado graph has some really interesting properties, such as that it contains every finite and countably infinite graph as induced subgraphs. Basically this means, as far as countably infinite graphs go, it’s the big momma of all graphs. It’s the graph in a very concrete sense of the word. It satisfies all of the formulas in $ \Phi$, and in fact it’s uniquely determined by this, meaning that if any other countably infinite graph satisfies all the formulas in $ \Phi$, then that graph is isomorphic to the Rado graph.

But for our purposes (proving a zero-one law), there’s a better perspective than graph theory on this object. In the logic perspective, the set $ \Phi$ is called a theory, meaning a set of statements that you consider “axioms” in some logical system. And we’re asking whether there any model realizing the theory. That is, is there some logical system with a semantic interpretation (some mathematical object based on numbers, or sets, or whatever) that satisfies all the axioms?

A good analogy comes from the rational numbers, because they satisfy a similar property among all ordered sets. In fact, the rational numbers are the unique countable, ordered set with the property that it has no biggest/smallest element and is dense. That is, in the ordering there is always another element between any two elements you want. So the theorem says if you have two countable sets with these properties, then they are actually isomorphic as ordered sets, and they are isomorphic to the rational numbers.

So, while we won’t prove that the Rado graph is a model for our theory $ \Phi$, we will use that fact to great benefit. One consequence of having a theory with a model is that the theory is consistent, meaning it can’t imply any contradictions. Another fact is that this theory $ \Phi$ is complete. Completeness means that any formula or it’s negation is logically implied by the theory. Note these are syntactical implications (using standard rules of propositional logic), and have nothing to do with the model interpreting the theory.

The proof that $ \Phi$ is complete actually follows from the uniqueness of the Rado graph as the only countable model of $ \Phi$. Suppose the contrary, that $ \Phi$ is not consistent, then there has to be some formula $ \psi$ that is not provable, and it’s negation is also not provable, by starting from $ \Phi$. Now extend $ \Phi$ in two ways: by adding $ \psi$ and by adding $ \neg \psi$. Both of the new theories are still countable, and by a theorem from logic this means they both still have countable models. But both of these new models are also countable models of $ \Phi$, so they have to both be the Rado graph. But this is very embarrassing for them, because we assumed they disagree on the truth of $ \psi$.

So now we can go ahead and prove the zero-one law theorem.

Return to proof.

Given an arbitrary property $ \varphi \not \in \Psi$. Now either $ \varphi$ or it’s negation can be derived from $ \Phi$. Without loss of generality suppose it’s $ \varphi$. Take all the formulas from the theory you need to derive $ \varphi$, and note that since it is a proof in propositional logic you will only finitely many such $ \varphi_{k,l}$. Now look at the probabilities of the $ \varphi_{k,l}$: they are all true with probability tending to 1, so the implied statement of the proof of $ \varphi$ (i.e., $ \varphi$ itself) must also hold with probability tending to 1. And we’re done!

$ \square$

If you don’t like model theory, there is another “purely combinatorial” proof of the zero-one law using something called Ehrenfeucht–Fraïssé games. It is a bit longer, though.

Other zero-one laws

One might naturally ask two questions: what if your probability is not constant, and what other kinds of properties have zero-one laws? Both great questions.

For the first, there are some extra theorems. I’ll just describe one that has always seemed very strange to me. If your probability is of the form $ p = n^{-\alpha}$ but $ \alpha$ is irrational, then the zero-one law still holds! This is a theorem of Baldwin-Shelah-Spencer, and it really makes you wonder why irrational numbers would be so well behaved while rational numbers are not 🙂

For the second question, there is another theorem about monotone properties of graphs. Monotone properties come in two flavors, so called “increasing” and “decreasing.” I’ll describe increasing monotone properties and the decreasing counterpart should be obvious. A property is called monotone increasing if adding edges can never destroy the property. That is, with an empty graph you don’t have the property (or maybe you do), and as you start adding edges eventually you suddenly get the property, but then adding more edges can’t cause you to lose the property again. Good examples of this include connectivity, or the existence of a triangle.

So the theorem is that there is an identical zero-one law for monotone properties. Great!

It’s not so often that you get to see these neat applications of logic and model theory to graph theory and (by extension) computer science. But when you do get to apply them they seem very powerful and mysterious. I think it’s a good thing.

Until next time!

A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$ \displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $ \wedge$ denotes a logical AND and the vee $ \vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $ x_i$ or the negation $ \overline{x_i}$, and connectives $ \wedge$ and $ \vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $ C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $ \vee$’s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $ G$, we’ll construct a set $ S_G$ of labeled truth assignments, and we’ll define the distribution $ D$ to be the uniform distribution over those truth assignments used in $ S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $ S_G$ exactly how we labeled them, and we set the allowed error $ \varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $ G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $ G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $ G$ with $ n$ nodes $ v_1, \dots, v_n$ and a set of $ m$ undirected edges $ E$, we construct a set of examples with positive labels $ S^+$ and one with negative examples $ S^-$. The examples are truth assignments to $ n$ variables, which we label $ x_1, \dots, x_n$, and we identify a truth assignment to the $ \left \{ 0,1 \right \}$-valued vector $ (x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $ S^+$ are simple: for each $ v_i$ add a truth assignment $ x_i = T, x_j = F$ for $ j \neq i$. I.e., the binary vector is $ (1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $ i$-th position.

The negative examples $ S^-$ come from the edges. For each edge $ (v_i, v_j) \in E$, we add the example with a zero in the $ i$-th and $ j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

PAC-reduction

Claim: $ G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $ \varphi$.

Again, consistent just means that $ \varphi$ is satisfied by every truth assignment in $ S^+$ and unsatisfied by every example in $ S^-$. Since we chose our distribution to be uniform over $ S^+ \cup S^-$, we don’t care what $ \varphi$ does elsewhere.

Indeed, if $ G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $ T_R$ be the AND of all the literals $ x_i$ for which vertex $ v_i$ is not red. For each such $ i$, the corresponding example in $ S^+$ will satisfy $ T_R$, because we put a zero in the $ i$-th position and ones everywhere else. Similarly, no example in $ S^-$ will make $ T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $ (v_1,v_2)$. Then the corresponding negative example is $ (0,0,1)$. Unless both $ v_1$ and $ v_2$ are colored red, one of $ x_1, x_2$ will have to be ANDed as part of $ T_R$. But the example has a zero for both $ x_1$ and $ x_2$, so $ T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $ T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $ \varphi$. We need to construct a three coloring of $ G$. It goes in largely the same way: label the clauses $ \varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $ v_i$ the color of the clause that is satisfied by the corresponding example in $ S^+$. There must be some clause that does this because $ \varphi$ is consistent with $ S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $ (v_i, v_j)$, and both $ v_i$ and $ v_j$ are colored, say, blue. Look at the clause $ T_B$: since $ v_i$ and $ v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make $ T_B$ true. Since those two positive examples differ in both their $ i$-th and $ j$-th positions, $ T_B$ can’t have any of the literals $ x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $ T_B$ because it has 1’s everywhere except $ i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $ G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $ \delta < 1/2$ and $ \varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $ 1 / |S^+ \cup S^-|$, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $ C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $ C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $ C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $ \mathsf{C}$ over a set $ X$ is efficiently PAC-learnable using the hypothesis class $ \mathsf{H}$ if there exists an algorithm $ A(\varepsilon, \delta)$ with access to a query function for $ \mathsf{C}$ and runtime $ O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $ c \in \mathsf{C}$, all distributions $ D$ over $ X$, and all $ 0 < \delta , \varepsilon < 1/2$, the probability that $ A$ produces a hypothesis $ h \in \mathsf{H}$ with error at most $ \varepsilon$ is at least $ 1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!

Methods of Proof — Contrapositive

In this post we’ll cover the second of the “basic four” methods of proof: the contrapositive implication. We will build off our material from last time and start by defining functions on sets.

Functions as Sets

So far we have become comfortable with the definition of a set, but the most common way to use sets is to construct functions between them. As programmers we readily understand the nature of a function, but how can we define one mathematically? It turns out we can do it in terms of sets, but let us recall the desired properties of a function:

  • Every input must have an output.
  • Every input can only correspond to one output (the functions must be deterministic).

One might try at first to define a function in terms of subsets of size two. That is, if $ A, B$ are sets then a function $ f: A \to B$ would be completely specified by

$ \displaystyle \left \{ \left \{ x, y \right \} : x \in A, y \in B \right \}$

where to enforce those two bullets, we must impose the condition that every $ x \in A$ occurs in one and only one of those subsets. Notationally, we would say that $ y = f(x)$ means $ \left \{ x, y \right \}$ is a member of the function. Unfortunately, this definition fails miserably when $ A = B$, because we have no way to distinguish the input from the output.

To compensate for this, we introduce a new type of object called a tuple. A tuple is just an ordered list of elements, which we write using round brackets, e.g. $ (a,b,c,d,e)$.

As a quick aside, one can define ordered tuples in terms of sets. We will leave the reader to puzzle why this works, and generalize the example provided:

$ \displaystyle (a,b) = \left \{ a, \left \{ a, b \right \} \right \}$

And so a function $ f: A \to B$ is defined to be a list of ordered pairs where the first thing in the pair is an input and the second is an output:

$ \displaystyle f = \left \{ (x, y) : x \in A, y \in B \right \}$

Subject to the same conditions, that each $ x$ value from $ A$ must occur in one and only one pair. And again by way of notation we say $ y = f(x)$ if the pair $ (x,y)$ is a member of $ f$ as a set. Note that the concept of a function having “input and output” is just an interpretation. A function can be viewed independent of any computational ideas as just a set of pairs. Often enough we might not even know how to compute a function (or it might be provably uncomputable!), but we can still work with it abstractly.

It is also common to call functions “maps,” and to define “map” to mean a special kind of function (that is, with extra conditions) depending on the mathematical field one is working in. Even in other places on this blog, “map” might stand for a continuous function, or a homomorphism. Don’t worry if you don’t know these terms off hand; they are just special cases of functions as we’ve defined them here. For the purposes of this series on methods of proof, “function” and “map” and “mapping” mean the same thing: regular old functions on sets.

Injections

One of the most important and natural properties of a function is that of injectivity.

Definition: A function $ f: A \to B$ is an injection if whenever $ a \neq a’$ are distinct members of $ A$, then $ f(a) \neq f(a’)$. The adjectival version of the word injection is injective.

As a quick side note, it is often the convention for mathematicians to use a capital letter to denote a set, and a lower-case letter to denote a generic element of that set. Moreover, the apostrophe on the $ a’$ is called a prime (so $ a’$ is spoken, “a prime”), and it’s meant to denote a variation on the non-prime’d variable $ a$ in some way. In this case, the variation is that $ a’ \neq a$.

So even if we had not explicitly mentioned where the $ a, a’$ objects came from, the knowledgeable mathematician (which the reader is obviously becoming) would be reasonably certain that they come from $ A$. Similarly, if I were to lackadaisically present $ b$ out of nowhere, the reader would infer it must come from $ B$.

One simple and commonly used example of an injection is the so-called inclusion function. If $ A \subset B$ are sets, then there is a canonical function representing this subset relationship, namely the function $ i: A \to B$ defined by $ i(a) = a$. It should be clear that non-equal things get mapped to non-equal things, because the function doesn’t actually do anything except change perspective on where the elements are sitting: two nonequal things sitting in $ A$ are still nonequal in $ B$.

Another example is that of multiplication by two as a map on natural numbers. More rigorously, define $ f: \mathbb{N} \to \mathbb{N}$ by $ f(x) = 2x$. It is clear that whenever $ x \neq y$ as natural numbers then $ 2x \neq 2y$. For one, $ x, y$ must have differing prime factorizations, and so must $ 2x, 2y$ because we added the same prime factor of 2 to both numbers. Did you catch the quick proof by direct implication there? It was sneaky, but present.

Now the property of being an injection can be summed up by a very nice picture:

A picture example of an injective function.

A picture example of an injective function.

The arrows above represent the pairs $ (x,f(x))$, and the fact that no two arrows end in the same place makes this function an injection. Indeed, drawing pictures like this can give us clues about the true nature of a proposed fact. If the fact is false, it’s usually easy to draw a picture like this showing so. If it’s true, then the pictures will support it and hopefully make the proof obvious. We will see this in action in a bit (and perhaps we should expand upon it later with a post titled, “Methods of Proof — Proof by Picture”).

There is another, more subtle concept associated with injectivity, and this is where its name comes from. The word “inject” gives one the mental picture that we’re literally placing one set $ A$ inside another set $ B$ without changing the nature of $ A$. We are simply realizing it as being inside of $ B$, perhaps with different names for its elements. This interpretation becomes much clearer when one investigates sets with additional structure, such as groups, rings, or topological spaces. Here the word “injective mapping” much more literally means placing one thing inside another without changing the former’s structure in any way except for relabeling.

In any case, mathematicians have the bad (but time-saving) habit of implicitly identifying a set with its image under an injective mapping. That is, if $ f :A \to B$ is an injective function, then one can view $ A$ as the same thing as $ f(A) \subset B$. That is, they have the same elements except that $ f$ renames the elements of $ A$ as elements of $ B$. The abuse comes in when they start saying $ A \subset B$ even when this is not strictly the case.

Here is an example of this abuse that many programmers commit without perhaps noticing it. Suppose $ X$ is the set of all colors that can be displayed on a computer (as an abstract set; the elements are “this particular green,” “that particular pinkish mauve”). Now let $ Y$ be the set of all finite hexadecimal numbers. Then there is an obvious injective map from $ X \to Y$ sending each color to its 6-digit hex representation. The lazy mathematician would say “Well, then, we might as well say $ X \subset Y$, for this is the obvious way to view $ X$ as a set of hexadecimal numbers.” Of course there are other ways (try to think of one, and then try to find an infinite family of them!), but the point is that this is the only way that anyone really uses, and that the other ways are all just “natural relabelings” of this way.

The precise way to formulate this claim is as follows, and it holds for arbitrary sets and arbitrary injective functions. If $ g, g’: X \to Y$ are two such ways to inject $ X$ inside of $ Y$, then there is a function $ h: Y \to Y$ such that the composition $ hg$ is precisely the map $ g’$. If this is mysterious, we have some methods the reader can use to understand it more fully: give examples for simplified versions (what if there were only three colors?), draw pictures of “generic looking” set maps, and attempt a proof by direct implication.

Proof by Contrapositive

Often times in mathematics we will come across a statement we want to prove that looks like this:

If X does not have property A, then Y does not have property B.

Indeed, we already have: to prove a function $ f: X \to Y$ is injective we must prove:

If x is not equal to y, then f(x) is not equal to f(y).

A proof by direct implication can be quite difficult because the statement gives us very little to work with. If we assume that $ X$ does not have property $ A$, then we have nothing to grasp and jump-start our proof. The main (and in this author’s opinion, the only) benefit of a proof by contrapositive is that one can turn such a statement into a constructive one. That is, we can write “p implies q” as “not q implies not p” to get the equivalent claim:

If Y has property B then X has property A.

This rewriting is called the “contrapositive form” of the original statement. It’s not only easier to parse, but also probably easier to prove because we have something to grasp at from the beginning.

To the beginning mathematician, it may not be obvious that “if p then q” is equivalent to “if not q then not p” as logical statements. To show that they are requires a small detour into the idea of a “truth table.”

In particular, we have to specify what it means for “if p then q” to be true or false as a whole. There are four possibilities: p can be true or false, and q can be true or false. We can write all of these possibilities in a table.

p  q
T  T
T  F
F  T
F  F

If we were to complete this table for “if p then q,” we’d have to specify exactly which of the four cases correspond to the statement being true. Of course, if the p part is true and the q part is true, then “p implies q” should also be true. We have seen this already in proof by direct implication. Next, if p is true and q is false, then it certainly cannot be the case that truth of p implies the truth of q. So this would be a false statement. Our truth table so far looks like

p  q  p->q
T  T   T
T  F   F
F  T   ?
F  F   ?

The next question is what to do if the premise p of “if p then q” is false. Should the statement as a whole be true or false? Rather then enter a belated philosophical discussion, we will zealously define an implication to be true if its hypothesis is false. This is a well-accepted idea in mathematics called vacuous truth. And although it seems to make awkward statements true (like “if 2 is odd then 1 = 0”), it is rarely a confounding issue (and more often forms the punchline of a few good math jokes). So we can complete our truth table as follows

p q  p->q
T T   T
T F   F
F T   T
F F   T

Now here’s where contraposition comes into play. If we’re interested in determining when “not q implies not p” is true, we can add these to the truth table as extra columns:

p  q  p->q  not q   not p   not q -> not p
T  T   T      F       F           T
T  F   F      T       F           F
F  T   T      F       T           T
F  F   T      T       T           T

As we can see, the two columns corresponding to “p implies q” and “not q implies not p” assume precisely the same truth values in all possible scenarios. In other words, the two statements are logically equivalent.

And so our proof technique for contrapositive becomes: rewrite the statement in its contrapositive form, and proceed to prove it by direct implication.

Examples and Exercises

Our first example will be completely straightforward and require nothing but algebra. Let’s show that the function $ f(x) = 7x – 4$ is injective. Contrapositively, we want to prove that if $ f(x) = f(x’)$ then $ x = x’$. Assuming the hypothesis, we start by supposing $ 7x – 4 = 7x’ – 4$. Applying algebra, we get $ 7x = 7x’$, and dividing by 7 shows that $x = x’$ as desired. So $ f$ is injective.

This example is important because if we tried to prove it directly, we might make the mistake of assuming algebra works with $ \neq$ the same way it does with equality. In fact, many of the things we take for granted about equality fail with inequality (for instance, if $ a \neq b$ and $ b \neq c$ it need not be the case that $ a \neq c$). The contrapositive method allows us to use our algebraic skills in a straightforward way.

Next let’s prove that the composition of two injective functions is injective. That is, if $ f: X \to Y$ and $ g: Y \to Z$ are injective functions, then the composition $ gf : X \to Z$  defined by $ gf(x) = g(f(x))$ is injective.

In particular, we want to prove that if $ x \neq x’$ then $ g(f(x)) \neq g(f(x’))$. Contrapositively, this is the same as proving that if $ g(f(x)) = g(f(x’))$ then $ x=x’$. Well by the fact that $ g$ is injective, we know that (again contrapositively) whenever $ g(y) = g(y’)$ then $ y = y’$, so it must be that $ f(x) = f(x’)$. But by the same reasoning $ f$ is injective and hence $ x = x’$. This proves the statement.

This was a nice symbolic proof, but we can see the same fact in a picturesque form as well:

A composition of two injections is an injection.

A composition of two injections is an injection.

If we maintain that any two arrows in the diagram can’t have the same head, then following two paths starting at different points in $ X$ will never land us at the same place in $ Z$. Since $ f$ is injective we have to travel to different places in $ Y$, and since $ g$ is injective we have to travel to different places in $ Z$. Unfortunately, this proof cannot replace the formal one above, but it can help us understand it from a different perspective (which can often make or break a mathematical idea).

Expanding upon this idea we give the reader a challenge: Let $ A, B, C$ be finite sets of the same size. Prove or disprove that if $ f: A \to B$ and $ g: B \to C$ are (arbitrary) functions, and if the composition $ gf$ is injective, then both of $ f, g$ must be injective.

Another exercise which has a nice contrapositive proof: prove that if $ A,B$ are finite sets and $ f:A \to B$ is an injection, then $ A$ has at most as many elements as $ B$. This one is particularly susceptible to a “picture proof” like the one above. Although the formal the formal name for the fact one uses to prove this is the pigeonhole principleit’s really just a simple observation.

Aside from inventing similar exercises with numbers (e.g., if $ ab$ is odd then $ a$ is odd or $ b$ is odd), this is all there is to the contrapositive method. It’s just a direct proof disguised behind a fact about truth tables. Of course, as is usual in more advanced mathematical literature, authors will seldom announce the use of contraposition. The reader just has to be watchful enough to notice it.

Though we haven’t talked about either the real numbers $ \mathbb{R}$ nor proofs of existence or impossibility, we can still pose this interesting question: is there an injective function from $ \mathbb{R} \to \mathbb{N}$? In truth there is not, but as of yet we don’t have the proof technique required to show it. This will be our next topic in the series: the proof by contradiction.

Until then!