# Zero-One Laws for Random Graphs

Last time we saw a number of properties of graphs, such as connectivity, where the probability that an Erdős–Rényi random graph $G(n,p)$ satisfies the property is asymptotically either zero or one. And this zero or one depends on whether the parameter $p$ is above or below a universal threshold (that depends only on $n$ and the property in question).

To remind the reader, the Erdős–Rényi random “graph” $G(n,p)$ is a distribution over graphs that you draw from by including each edge independently with probability $p$. Last time we saw that the existence of an isolated vertex has a sharp threshold at $(\log n) / n$, meaning if $p$ is asymptotically smaller than the threshold there will certainly be isolated vertices, and if $p$ is larger there will certainly be no isolated vertices. We also gave a laundry list of other properties with such thresholds.

One might want to study this phenomenon in general. Even if we might not be able to find all the thresholds we want for a given property, can we classify which properties have thresholds and which do not?

The answer turns out to be mostly yes! For large classes of properties, there are proofs that say things like, “either this property holds with probability tending to one, or it holds with probability tending to zero.” These are called “zero-one laws,” and they’re sort of meta theorems. We’ll see one such theorem in this post relating to constant edge-probabilities in random graphs, and we’ll remark on another at the end.

## Sentences about graphs in first order logic

A zero-one law generally works by defining a class of properties, and then applying a generic first/second moment-type argument to every property in the class.

So first we define what kinds of properties we’ll discuss. We’ll pick a large class: anything that can be expressed in first-order logic in the language of graphs. That is, any finite logical statement that uses existential and universal quantifiers over variables, and whose only relation (test) is whether an edge exists between two vertices. We’ll call this test $e(x,y)$. So you write some sentence $P$ in this language, and you take a graph $G$, and you can ask $P(G) = 1$, whether the graph satisfies the sentence.

This seems like a really large class of properties, and it is, but let’s think carefully about what kinds of properties can be expressed this way. Clearly the existence of a triangle can be written this way, it’s just the sentence

$\exists x,y,z : e(x,y) \wedge e(y,z) \wedge e(x,z)$

I’m using $\wedge$ for AND, and $\vee$ for OR, and $\neg$ for NOT. Similarly, one can express the existence of a clique of size $k$, or the existence of an independent set of size $k$, or a path of a fixed length, or whether there is a vertex of maximal degree $n-1$.

Here’s a question: can we write a formula which will be true for a graph if and only if it’s connected? Well such a formula seems like it would have to know about how many vertices there are in the graph, so it could say something like “for all $x,y$ there is a path from $x$ to $y$.” It seems like you’d need a family of such formulas that grows with $n$ to make anything work. But this isn’t a proof; the question remains whether there is some other tricky way to encode connectivity.

But as it turns out, connectivity is not a formula you can express in propositional logic. We won’t prove it here, but we will note at the end of the article that connectivity is in a different class of properties that you can prove has a similar zero-one law.

## The zero-one law for first order logic

So the theorem about first-order expressible sentences is as follows.

Theorem: Let $P$ be a property of graphs that can be expressed in the first order language of graphs (with the $e(x,y)$ relation). Then for any constant $p$, the probability that $P$ holds in $G(n,p)$ has a limit of zero or one as $n \to \infty$.

Proof. We’ll prove the simpler case of $p=1/2$, but the general case is analogous. Given such a graph $G$ drawn from $G(n,p)$, what we’ll do is define a countably infinite family of propositional formulas $\varphi_{k,l}$, and argue that they form a sort of “basis” for all first-order sentences about graphs.

First let’s describe the $\varphi_{k,l}$. For any $k,l \in \mathbb{N}$, the sentence will assert that for every set of $k$ vertices and every set of $l$ vertices, there is some other vertex connected to the first $k$ but not the last $l$.

$\displaystyle \varphi_{k,l} : \forall x_1, \dots, x_k, y_1, \dots, y_l \exists z : \\ e(z,x_1) \wedge \dots \wedge e(z,x_k) \wedge \neg e(z,y_1) \wedge \dots \wedge \neg e(z,y_l)$.

In other words, these formulas encapsulate every possible incidence pattern for a single vertex. It is a strange set of formulas, but they have a very nice property we’re about to get to. So for a fixed $\varphi_{k,l}$, what is the probability that it’s false on $n$ vertices? We want to give an upper bound and hence show that the formula is true with probability approaching 1. That is, we want to show that all the $\varphi_{k,l}$ are true with probability tending to 1.

Computing the probability: we have $\binom{n}{k} \binom{n-k}{l}$ possibilities to choose these sets, and the probability that some other fixed vertex $z$ has the good connections is $2^{-(k+l)}$ so the probability $z$ is not good is $1 - 2^{-(k+l)}$, and taking a product over all choices of $z$ gives the probability that there is some bad vertex $z$ with an exponent of $(n - (k + l))$. Combining all this together gives an upper bound of $\varphi_{k,l}$ being false of:

$\displaystyle \binom{n}{k}\binom{n-k}{l} (1-2^{-k-1})^{n-k-l}$

And $k, l$ are constant, so the left two terms are polynomials while the rightmost term is an exponentially small function, and this implies that the whole expression tends to zero, as desired.

Break from proof.

## A bit of model theory

So what we’ve proved so far is that the probability of every formula of the form $\varphi_{k,l}$ being satisfied in $G(n,1/2)$ tends to 1.

Now look at the set of all such formulas

$\displaystyle \Phi = \{ \varphi_{k,l} : k,l \in \mathbb{N} \}$

We ask: is there any graph which satisfies all of these formulas? Certainly it cannot be finite, because a finite graph would not be able to satisfy formulas with sufficiently large values of $l, k > n$. But indeed, there is a countably infinite graph that works. It’s called the Rado graph, pictured below.

The Rado graph has some really interesting properties, such as that it contains every finite and countably infinite graph as induced subgraphs. Basically this means, as far as countably infinite graphs go, it’s the big momma of all graphs. It’s the graph in a very concrete sense of the word. It satisfies all of the formulas in $\Phi$, and in fact it’s uniquely determined by this, meaning that if any other countably infinite graph satisfies all the formulas in $\Phi$, then that graph is isomorphic to the Rado graph.

But for our purposes (proving a zero-one law), there’s a better perspective than graph theory on this object. In the logic perspective, the set $\Phi$ is called a theory, meaning a set of statements that you consider “axioms” in some logical system. And we’re asking whether there any model realizing the theory. That is, is there some logical system with a semantic interpretation (some mathematical object based on numbers, or sets, or whatever) that satisfies all the axioms?

A good analogy comes from the rational numbers, because they satisfy a similar property among all ordered sets. In fact, the rational numbers are the unique countable, ordered set with the property that it has no biggest/smallest element and is dense. That is, in the ordering there is always another element between any two elements you want. So the theorem says if you have two countable sets with these properties, then they are actually isomorphic as ordered sets, and they are isomorphic to the rational numbers.

So, while we won’t prove that the Rado graph is a model for our theory $\Phi$, we will use that fact to great benefit. One consequence of having a theory with a model is that the theory is consistent, meaning it can’t imply any contradictions. Another fact is that this theory $\Phi$ is complete. Completeness means that any formula or it’s negation is logically implied by the theory. Note these are syntactical implications (using standard rules of propositional logic), and have nothing to do with the model interpreting the theory.

The proof that $\Phi$ is complete actually follows from the uniqueness of the Rado graph as the only countable model of $\Phi$. Suppose the contrary, that $\Phi$ is not consistent, then there has to be some formula $\psi$ that is not provable, and it’s negation is also not provable, by starting from $\Phi$. Now extend $\Phi$ in two ways: by adding $\psi$ and by adding $\neg \psi$. Both of the new theories are still countable, and by a theorem from logic this means they both still have countable models. But both of these new models are also countable models of $\Phi$, so they have to both be the Rado graph. But this is very embarrassing for them, because we assumed they disagree on the truth of $\psi$.

So now we can go ahead and prove the zero-one law theorem.

Given an arbitrary property $\varphi \not \in \Psi$. Now either $\varphi$ or it’s negation can be derived from $\Phi$. Without loss of generality suppose it’s $\varphi$. Take all the formulas from the theory you need to derive $\varphi$, and note that since it is a proof in propositional logic you will only finitely many such $\varphi_{k,l}$. Now look at the probabilities of the $\varphi_{k,l}$: they are all true with probability tending to 1, so the implied statement of the proof of $\varphi$ (i.e., $\varphi$ itself) must also hold with probability tending to 1. And we’re done!

$\square$

If you don’t like model theory, there is another “purely combinatorial” proof of the zero-one law using something called Ehrenfeucht–Fraïssé games. It is a bit longer, though.

## Other zero-one laws

One might naturally ask two questions: what if your probability is not constant, and what other kinds of properties have zero-one laws? Both great questions.

For the first, there are some extra theorems. I’ll just describe one that has always seemed very strange to me. If your probability is of the form $p = n^{-\alpha}$ but $\alpha$ is irrational, then the zero-one law still holds! This is a theorem of Baldwin-Shelah-Spencer, and it really makes you wonder why irrational numbers would be so well behaved while rational numbers are not :)

For the second question, there is another theorem about monotone properties of graphs. Monotone properties come in two flavors, so called “increasing” and “decreasing.” I’ll describe increasing monotone properties and the decreasing counterpart should be obvious. A property is called monotone increasing if adding edges can never destroy the property. That is, with an empty graph you don’t have the property (or maybe you do), and as you start adding edges eventually you suddenly get the property, but then adding more edges can’t cause you to lose the property again. Good examples of this include connectivity, or the existence of a triangle.

So the theorem is that there is an identical zero-one law for monotone properties. Great!

It’s not so often that you get to see these neat applications of logic and model theory to graph theory and (by extension) computer science. But when you do get to apply them they seem very powerful and mysterious. I think it’s a good thing.

Until next time!

# A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

## 3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$\displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $\wedge$ denotes a logical AND and the vee $\vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $x_i$ or the negation $\overline{x_i}$, and connectives $\wedge$ and $\vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $\vee$‘s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $G$, we’ll construct a set $S_G$ of labeled truth assignments, and we’ll define the distribution $D$ to be the uniform distribution over those truth assignments used in $S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $S_G$ exactly how we labeled them, and we set the allowed error $\varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

## The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $G$ with $n$ nodes $v_1, \dots, v_n$ and a set of $m$ undirected edges $E$, we construct a set of examples with positive labels $S^+$ and one with negative examples $S^-$. The examples are truth assignments to $n$ variables, which we label $x_1, \dots, x_n$, and we identify a truth assignment to the $\left \{ 0,1 \right \}$-valued vector $(x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $S^+$ are simple: for each $v_i$ add a truth assignment $x_i = T, x_j = F$ for $j \neq i$. I.e., the binary vector is $(1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $i$-th position.

The negative examples $S^-$ come from the edges. For each edge $(v_i, v_j) \in E$, we add the example with a zero in the $i$-th and $j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

Claim: $G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $\varphi$.

Again, consistent just means that $\varphi$ is satisfied by every truth assignment in $S^+$ and unsatisfied by every example in $S^-$. Since we chose our distribution to be uniform over $S^+ \cup S^-$, we don’t care what $\varphi$ does elsewhere.

Indeed, if $G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $T_R$ be the AND of all the literals $x_i$ for which vertex $v_i$ is not red. For each such $i$, the corresponding example in $S^+$ will satisfy $T_R$, because we put a zero in the $i$-th position and ones everywhere else. Similarly, no example in $S^-$ will make $T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $(v_1,v_2)$. Then the corresponding negative example is $(0,0,1)$. Unless both $v_1$ and $v_2$ are colored red, one of $x_1, x_2$ will have to be ANDed as part of $T_R$. But the example has a zero for both $x_1$ and $x_2$, so $T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $\varphi$. We need to construct a three coloring of $G$. It goes in largely the same way: label the clauses $\varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $v_i$ the color of the clause that is satisfied by the corresponding example in $S^+$. There must be some clause that does this because $\varphi$ is consistent with $S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $(v_i, v_j)$, and both $v_i$ and $v_j$ are colored, say, blue. Look at the clause $T_B$: since $v_i$ and $v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make $T_B$ true. Since those two positive examples differ in both their $i$-th and $j$-th positions, $T_B$ can’t have any of the literals $x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $T_B$ because it has 1’s everywhere except $i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $\delta < 1/2$ and $\varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $1 / |S^+ \cup S^-|$, which is too big.

## Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $\mathsf{C}$ over a set $X$ is efficiently PAC-learnable using the hypothesis class $\mathsf{H}$ if there exists an algorithm $A(\varepsilon, \delta)$ with access to a query function for $\mathsf{C}$ and runtime $O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $c \in \mathsf{C}$, all distributions $D$ over $X$, and all $0 < \delta , \varepsilon < 1/2$, the probability that $A$ produces a hypothesis $h \in \mathsf{H}$ with error at most $\varepsilon$ is at least $1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!

# Methods of Proof — Contrapositive

In this post we’ll cover the second of the “basic four” methods of proof: the contrapositive implication. We will build off our material from last time and start by defining functions on sets.

## Functions as Sets

So far we have become comfortable with the definition of a set, but the most common way to use sets is to construct functions between them. As programmers we readily understand the nature of a function, but how can we define one mathematically? It turns out we can do it in terms of sets, but let us recall the desired properties of a function:

• Every input must have an output.
• Every input can only correspond to one output (the functions must be deterministic).

One might try at first to define a function in terms of subsets of size two. That is, if $A, B$ are sets then a function $f: A \to B$ would be completely specified by

$\displaystyle \left \{ \left \{ x, y \right \} : x \in A, y \in B \right \}$

where to enforce those two bullets, we must impose the condition that every $x \in A$ occurs in one and only one of those subsets. Notationally, we would say that $y = f(x)$ means $\left \{ x, y \right \}$ is a member of the function. Unfortunately, this definition fails miserably when $A = B$, because we have no way to distinguish the input from the output.

To compensate for this, we introduce a new type of object called a tuple. A tuple is just an ordered list of elements, which we write using round brackets, e.g. $(a,b,c,d,e)$.

As a quick aside, one can define ordered tuples in terms of sets. We will leave the reader to puzzle why this works, and generalize the example provided:

$\displaystyle (a,b) = \left \{ a, \left \{ a, b \right \} \right \}$

And so a function $f: A \to B$ is defined to be a list of ordered pairs where the first thing in the pair is an input and the second is an output:

$\displaystyle f = \left \{ (x, y) : x \in A, y \in B \right \}$

Subject to the same conditions, that each $x$ value from $A$ must occur in one and only one pair. And again by way of notation we say $y = f(x)$ if the pair $(x,y)$ is a member of $f$ as a set. Note that the concept of a function having “input and output” is just an interpretation. A function can be viewed independent of any computational ideas as just a set of pairs. Often enough we might not even know how to compute a function (or it might be provably uncomputable!), but we can still work with it abstractly.

It is also common to call functions “maps,” and to define “map” to mean a special kind of function (that is, with extra conditions) depending on the mathematical field one is working in. Even in other places on this blog, “map” might stand for a continuous function, or a homomorphism. Don’t worry if you don’t know these terms off hand; they are just special cases of functions as we’ve defined them here. For the purposes of this series on methods of proof, “function” and “map” and “mapping” mean the same thing: regular old functions on sets.

## Injections

One of the most important and natural properties of a function is that of injectivity.

Definition: A function $f: A \to B$ is an injection if whenever $a \neq a'$ are distinct members of $A$, then $f(a) \neq f(a')$. The adjectival version of the word injection is injective.

As a quick side note, it is often the convention for mathematicians to use a capital letter to denote a set, and a lower-case letter to denote a generic element of that set. Moreover, the apostrophe on the $a'$ is called a prime (so $a'$ is spoken, “a prime”), and it’s meant to denote a variation on the non-prime’d variable $a$ in some way. In this case, the variation is that $a' \neq a$.

So even if we had not explicitly mentioned where the $a, a'$ objects came from, the knowledgeable mathematician (which the reader is obviously becoming) would be reasonably certain that they come from $A$. Similarly, if I were to lackadaisically present $b$ out of nowhere, the reader would infer it must come from $B$.

One simple and commonly used example of an injection is the so-called inclusion function. If $A \subset B$ are sets, then there is a canonical function representing this subset relationship, namely the function $i: A \to B$ defined by $i(a) = a$. It should be clear that non-equal things get mapped to non-equal things, because the function doesn’t actually do anything except change perspective on where the elements are sitting: two nonequal things sitting in $A$ are still nonequal in $B$.

Another example is that of multiplication by two as a map on natural numbers. More rigorously, define $f: \mathbb{N} \to \mathbb{N}$ by $f(x) = 2x$. It is clear that whenever $x \neq y$ as natural numbers then $2x \neq 2y$. For one, $x, y$ must have differing prime factorizations, and so must $2x, 2y$ because we added the same prime factor of 2 to both numbers. Did you catch the quick proof by direct implication there? It was sneaky, but present.

Now the property of being an injection can be summed up by a very nice picture:

A picture example of an injective function.

The arrows above represent the pairs $(x,f(x))$, and the fact that no two arrows end in the same place makes this function an injection. Indeed, drawing pictures like this can give us clues about the true nature of a proposed fact. If the fact is false, it’s usually easy to draw a picture like this showing so. If it’s true, then the pictures will support it and hopefully make the proof obvious. We will see this in action in a bit (and perhaps we should expand upon it later with a post titled, “Methods of Proof — Proof by Picture”).

There is another, more subtle concept associated with injectivity, and this is where its name comes from. The word “inject” gives one the mental picture that we’re literally placing one set $A$ inside another set $B$ without changing the nature of $A$. We are simply realizing it as being inside of $B$, perhaps with different names for its elements. This interpretation becomes much clearer when one investigates sets with additional structure, such as groups, rings, or topological spaces. Here the word “injective mapping” much more literally means placing one thing inside another without changing the former’s structure in any way except for relabeling.

In any case, mathematicians have the bad (but time-saving) habit of implicitly identifying a set with its image under an injective mapping. That is, if $f :A \to B$ is an injective function, then one can view $A$ as the same thing as $f(A) \subset B$. That is, they have the same elements except that $f$ renames the elements of $A$ as elements of $B$. The abuse comes in when they start saying $A \subset B$ even when this is not strictly the case.

Here is an example of this abuse that many programmers commit without perhaps noticing it. Suppose $X$ is the set of all colors that can be displayed on a computer (as an abstract set; the elements are “this particular green,” “that particular pinkish mauve”). Now let $Y$ be the set of all finite hexadecimal numbers. Then there is an obvious injective map from $X \to Y$ sending each color to its 6-digit hex representation. The lazy mathematician would say “Well, then, we might as well say $X \subset Y$, for this is the obvious way to view $X$ as a set of hexadecimal numbers.” Of course there are other ways (try to think of one, and then try to find an infinite family of them!), but the point is that this is the only way that anyone really uses, and that the other ways are all just “natural relabelings” of this way.

The precise way to formulate this claim is as follows, and it holds for arbitrary sets and arbitrary injective functions. If $g, g': X \to Y$ are two such ways to inject $X$ inside of $Y$, then there is a function $h: Y \to Y$ such that the composition $hg$ is precisely the map $g'$. If this is mysterious, we have some methods the reader can use to understand it more fully: give examples for simplified versions (what if there were only three colors?), draw pictures of “generic looking” set maps, and attempt a proof by direct implication.

## Proof by Contrapositive

Often times in mathematics we will come across a statement we want to prove that looks like this:

If X does not have property A, then Y does not have property B.

Indeed, we already have: to prove a function $f: X \to Y$ is injective we must prove:

If x is not equal to y, then f(x) is not equal to f(y).

A proof by direct implication can be quite difficult because the statement gives us very little to work with. If we assume that $X$ does not have property $A$, then we have nothing to grasp and jump-start our proof. The main (and in this author’s opinion, the only) benefit of a proof by contrapositive is that one can turn such a statement into a constructive one. That is, we can write “p implies q” as “not q implies not p” to get the equivalent claim:

If Y has property B then X has property A.

This rewriting is called the “contrapositive form” of the original statement. It’s not only easier to parse, but also probably easier to prove because we have something to grasp at from the beginning.

To the beginning mathematician, it may not be obvious that “if p then q” is equivalent to “if not q then not p” as logical statements. To show that they are requires a small detour into the idea of a “truth table.”

In particular, we have to specify what it means for “if p then q” to be true or false as a whole. There are four possibilities: p can be true or false, and q can be true or false. We can write all of these possibilities in a table.

p  q
T  T
T  F
F  T
F  F

If we were to complete this table for “if p then q,” we’d have to specify exactly which of the four cases correspond to the statement being true. Of course, if the p part is true and the q part is true, then “p implies q” should also be true. We have seen this already in proof by direct implication. Next, if p is true and q is false, then it certainly cannot be the case that truth of p implies the truth of q. So this would be a false statement. Our truth table so far looks like

p  q  p->q
T  T   T
T  F   F
F  T   ?
F  F   ?

The next question is what to do if the premise p of “if p then q” is false. Should the statement as a whole be true or false? Rather then enter a belated philosophical discussion, we will zealously define an implication to be true if its hypothesis is false. This is a well-accepted idea in mathematics called vacuous truth. And although it seems to make awkward statements true (like “if 2 is odd then 1 = 0”), it is rarely a confounding issue (and more often forms the punchline of a few good math jokes). So we can complete our truth table as follows

p q  p->q
T T   T
T F   F
F T   T
F F   T

Now here’s where contraposition comes into play. If we’re interested in determining when “not q implies not p” is true, we can add these to the truth table as extra columns:

p  q  p->q  not q   not p   not q -> not p
T  T   T      F       F           T
T  F   F      T       F           F
F  T   T      F       T           T
F  F   T      T       T           T

As we can see, the two columns corresponding to “p implies q” and “not q implies not p” assume precisely the same truth values in all possible scenarios. In other words, the two statements are logically equivalent.

And so our proof technique for contrapositive becomes: rewrite the statement in its contrapositive form, and proceed to prove it by direct implication.

## Examples and Exercises

Our first example will be completely straightforward and require nothing but algebra. Let’s show that the function $f(x) = 7x - 4$ is injective. Contrapositively, we want to prove that if $f(x) = f(x')$ then $x = x'$. Assuming the hypothesis, we start by supposing $7x - 4 = 7x' - 4$. Applying algebra, we get $7x = 7x'$, and dividing by 7 shows that $x = x’$ as desired. So $f$ is injective.

This example is important because if we tried to prove it directly, we might make the mistake of assuming algebra works with $\neq$ the same way it does with equality. In fact, many of the things we take for granted about equality fail with inequality (for instance, if $a \neq b$ and $b \neq c$ it need not be the case that $a \neq c$). The contrapositive method allows us to use our algebraic skills in a straightforward way.

Next let’s prove that the composition of two injective functions is injective. That is, if $f: X \to Y$ and $g: Y \to Z$ are injective functions, then the composition $gf : X \to Z$  defined by $gf(x) = g(f(x))$ is injective.

In particular, we want to prove that if $x \neq x'$ then $g(f(x)) \neq g(f(x'))$. Contrapositively, this is the same as proving that if $g(f(x)) = g(f(x'))$ then $x=x'$. Well by the fact that $g$ is injective, we know that (again contrapositively) whenever $g(y) = g(y')$ then $y = y'$, so it must be that $f(x) = f(x')$. But by the same reasoning $f$ is injective and hence $x = x'$. This proves the statement.

This was a nice symbolic proof, but we can see the same fact in a picturesque form as well:

A composition of two injections is an injection.

If we maintain that any two arrows in the diagram can’t have the same head, then following two paths starting at different points in $X$ will never land us at the same place in $Z$. Since $f$ is injective we have to travel to different places in $Y$, and since $g$ is injective we have to travel to different places in $Z$. Unfortunately, this proof cannot replace the formal one above, but it can help us understand it from a different perspective (which can often make or break a mathematical idea).

Expanding upon this idea we give the reader a challenge: Let $A, B, C$ be finite sets of the same size. Prove or disprove that if $f: A \to B$ and $g: B \to C$ are (arbitrary) functions, and if the composition $gf$ is injective, then both of $f, g$ must be injective.

Another exercise which has a nice contrapositive proof: prove that if $A,B$ are finite sets and $f:A \to B$ is an injection, then $A$ has at most as many elements as $B$. This one is particularly susceptible to a “picture proof” like the one above. Although the formal the formal name for the fact one uses to prove this is the pigeonhole principleit’s really just a simple observation.

Aside from inventing similar exercises with numbers (e.g., if $ab$ is odd then $a$ is odd or $b$ is odd), this is all there is to the contrapositive method. It’s just a direct proof disguised behind a fact about truth tables. Of course, as is usual in more advanced mathematical literature, authors will seldom announce the use of contraposition. The reader just has to be watchful enough to notice it.

Though we haven’t talked about either the real numbers $\mathbb{R}$ nor proofs of existence or impossibility, we can still pose this interesting question: is there an injective function from $\mathbb{R} \to \mathbb{N}$? In truth there is not, but as of yet we don’t have the proof technique required to show it. This will be our next topic in the series: the proof by contradiction.

Until then!

# Why there is no Hitchhiker’s Guide to Mathematics for Programmers

For those who aren’t regular readers: as a followup to this post, there are four posts detailing the basic four methods of proof, with intentions to detail some more advanced proof techniques in the future. You can find them on this blog’s primers page.

## Do you really want to get better at mathematics?

Remember when you first learned how to program? I do. I spent two years experimenting with Java programs on my own in high school. Those two years collectively contain the worst and most embarrassing code I have ever written. My programs absolutely reeked of programming no-nos. Hundred-line functions and even thousand-line classes, magic numbers, unreachable blocks of code, ridiculous code comments, a complete disregard for sensible object orientation, negligence of nearly all logic, and type-coercion that would make your skin crawl. I committed every naive mistake in the book, and for all my obvious shortcomings I considered myself a hot-shot programmer! At least I was learning a lot, and I was a hot-shot programmer in a crowd of high-school students interested in game programming.

Even after my first exposure and my commitment to get a programming degree in college, it was another year before I knew what a stack frame or a register was, two more before I was anywhere near competent with a terminal, three more before I fully appreciated functional programming, and to this day I still have an irrational fear of networking and systems programming (the first time I manually edited the call stack I couldn’t stop shivering with apprehension and disgust at what I was doing).

I just made this function call return to a *different* place than where it was called from.

In a class on C++ programming I was programming a Checkers game, and my task at the moment was to generate a list of all possible jump-moves that could be made on a given board. This naturally involved a depth-first search and a couple of recursive function calls, and once I had something I was pleased with, I compiled it and ran it on my first non-trivial example. Low and behold (even having followed test-driven development!), I was hit hard in the face by a segmentation fault. It took hundreds of test cases and more than twenty hours of confusion before I found the error: I was passing a reference when I should have been passing a pointer. This was not a bug in syntax or semantics (I understood pointers and references well enough) but a design error. And the aggravating part, as most programmers know, was that the fix required the change of about 4 characters. Twenty hours of work for four characters! Once I begrudgingly verified it worked (of course it worked, it was so obvious in hindsight), I promptly took the rest of the day off to play Starcraft.

Of course, as every code-savvy reader will agree, all of this drama is part of the process of becoming and strong programmer. One must study the topics incrementally, make plentiful mistakes and learn from them, and spend uncountably many hours in a state of stuporous befuddlement before one can be considered an experienced coder. This gives rise to all sorts of programmer culture, unix jokes, and reverence for the masters of C that make the programming community so lovely to be a part of. It’s like a secret club where you know all the handshakes. And should you forget one, a crafty use of awk and sed will suffice.

“Semicolons of Fury” was the name of my programming team in the ACM collegiate programming contest. We placed Cal Poly third in the Southern California Regionals, and in my opinion our success was due in large part to the dynamics of our team. I (center, in blue) have since gotten a more stylish haircut.

Now imagine someone comes along and says,

“I’m really interested in learning to code, but I don’t plan to write any programs and I absolutely abhor tracing program execution. I just want to use applications that others have written, like Chrome and iTunes.”

You would laugh at them! And the first thing that would pass through your mind is either, “This person would give up programming after the first twenty minutes,” or “I would be doing the world a favor by preventing this person from ever writing a program. This person belongs in some other profession.” This lies in stark opposition to the common chorus that everyone should learn programming. After all, it’s a constructive way to think about problem solving and a highly employable skill. In today’s increasingly technological world, it literally pays to know your computer better than a web browser. (Ironically, I’m writing this on my Chromebook, but in my defense it has a terminal with ssh. Perhaps more ironically, all of my real work is done with paper and pencil.)

Unfortunately this sentiment is mirrored among most programmers who claim to be interested in mathematics. Mathematics is fascinating and useful and doing it makes you smarter and better at problem solving. But a lot of programmers think they want to do mathematics, and they either don’t know what “doing mathematics” means, or they don’t really mean they want to do mathematics. The appropriate translation of the above quote for mathematics is:

“Mathematics is useful and I want to be better at it, but I won’t write any original proofs and I absolutely abhor reading other people’s proofs. I just want to use the theorems others have proved, like Fermat’s Last Theorem and the undecidability of the Halting Problem.”

Of course no non-mathematician is really going to understand the current proof of Fermat’s Last Theorem, just as no fledgling programmer is going to attempt to write a (quality) web browser. The point is that the sentiment is in the wrong place. Mathematics is cousin to programming in terms of the learning curve, obscure culture, and the amount of time one spends confused. And mathematics is as much about writing proofs as software development is about writing programs (it’s not everything, but without it you can’t do anything). Honestly, it sounds ridiculously obvious to say it directly like this, but the fact remains that people feel like they can understand the content of mathematics without being able to write or read proofs.

I want to devote the rest of this post to exploring some of the reasons why this misconception exists. My main argument is that the reasons have to do more with the culture of mathematics than the actual difficulty of the subject. Unfortunately as of the time of this writing I don’t have a proposed “solution.” And all I can claim is a problem is that programmers can have mistaken views of what mathematics involves. I don’t propose a way to make mathematics easier for programmers, although I do try to make the content on my blog as clear as possible (within reason). I honestly do believe that the struggle and confusion builds mathematical character, just as the arduous bug-hunt builds programming character. If you want to be good at mathematics, there is no other way.

All I want to do with this article is to detail why mathematics can be so hard for beginners, to explain a few of the secret handshakes, and hopefully to bring an outsider a step closer to becoming an insider. And I want to stress that this is not a call for all programmers to learn mathematics. Far from it! I just happen to notice that, for good reason, the proportion of programmers who are interested in mathematics is larger than in most professions. And as a member of both communities, I want to shed light on why mathematics can be difficult for an otherwise smart and motivated software engineer.

So read on, and welcome to the community.

## Travelling far and wide

Perhaps one of the most prominent objections to devoting a lot of time to mathematics is that it can be years before you ever apply mathematics to writing programs. On one hand, this is an extremely valid concern. If you love writing programs and designing software, then mathematics is nothing more than a tool to help you write better programs.

But on the other hand, the very nature of mathematics is what makes it so applicable, and the only way to experience nature is to ditch the city entirely. Indeed, I provide an extended example of this in my journalesque post on introducing graph theory to high school students: the point of the whole exercise is to filter out the worldly details and distill the problem into a pristine mathematical form. Only then can we see its beauty and wide applicability.

Here is a more concrete example. Suppose you were trying to encrypt the contents of a message so that nobody could read it even if they intercepted the message in transit. Your first ideas would doubtlessly be the same as those of our civilization’s past: substitution ciphers, Vigenere ciphers, the Enigma machine, etc. Regardless of what method you come up with, your first thought would most certainly not be, “prime numbers so big they’ll make your pants fall down.” Of course, the majority of encryption methods today rely on very deep facts (or rather, conjectures) about prime numbers, elliptic curves, and other mathematical objects (“group presentations so complicated they’ll orient your Mobius band,” anyone?). But it took hundreds of years of number theory to get there, and countless deviations into other fields and dead-ends. It’s not that the methods themselves are particularly complicated, but the way they’re often presented (and this is unavoidable if you’re interested in new mathematical breakthroughs) is in the form of classical mathematical literature.

Of course there are other examples much closer to contemporary fashionable programming techniques. One such example is boosting. While we have yet to investigate boosting on this blog, the basic idea is that one can combine a bunch of algorithms which perform just barely better than 50% accuracy, and collectively they will be arbitrarily close to perfect. In a field dominated by practical applications, this result is purely the product of mathematical analysis.

And of course boosting in turn relies on the mathematics of probability theory, which in turn relies on set theory and measure theory, which in turn relies on real analysis, and so on. One could get lost for a lifetime in this mathematical landscape! And indeed, the best way to get a good view of it all is to start at the bottom. To learn mathematics from scratch. The working programmer simply doesn’t have time for that.

## What is it really, that people have such a hard time learning?

Most of the complaints about mathematics come understandably from notation and abstraction. And while I’ll have more to say on that below, I’m fairly certain that the main obstacle is a familiarity with the basic methods of proof.

While methods of proof are semantical by nature, in practice they form a scaffolding for all of mathematics, and as such one could better characterize them as syntactical. I’m talking, of course, about the four basics: direct implication, proof by contradiction, contrapositive, and induction. These are the loops, if statements, pointers, and structs of rigorous argument, and there is simply no way to understand the mathematics without a native fluency in this language.

The “Math Major Sloth” is fluent. Why aren’t you?

So much of mathematics is built up by chaining together a multitude of absolutely trivial statements which are amendable to proof by the basic four. I’m not kidding when I say they are absolutely trivial. A professor of mine once said,

If it’s not completely trivial, then it’s probably not true.

I can’t agree more with this statement. Of course, there are many sophisticated proofs in mathematics, but an overwhelming majority of (very important) facts fall in the trivial category. That being said, trivial can be sometimes relative to one’s familiarity with a subject, but that doesn’t make the sentiment any less right. Drawing up a shopping list is trivial once you’re comfortable with a pencil and paper and you know how to write (and you know what the words mean). There are certainly works of writing that require a lot more than what it takes to write a shopping list. Likewise, when we say something is trivial in mathematics, it’s because there’s no content to the proof outside of using definitions and a typical application of the basic four methods of proof. This is the “holding a pencil” part of writing a shopping list.

And as you probably know, there are many many more methods of proof than just the basic four. Proof by construction, by exhaustion, case analysis, and even picture proofs have a place in all fields of mathematics. More relevantly for programmers, there are algorithm termination proofs, probabilistic proofs, loop invariants to design and monitor, and the ubiquitous NP-hardness proofs (I’m talking about you, Travelling Salesman Problem!). There are many books dedicated to showcasing such techniques, and rightly so. Clever proofs are what mathematicians strive for above all else, and once a clever proof is discovered, the immediate first step is to try to turn it into a general method for proving other facts. Fully flushing out such a process (over many years, showcasing many applications and extensions) is what makes one a world-class mathematician.

An entire book dedicated to the probabilistic method of proof, invented by Paul Erdős and sown into the soil of mathematics over the course of his lifetime.

Another difficulty faced by programmers new to mathematics is the inability to check your proof absolutely. With a program, you can always write test cases and run them to ensure they all pass. If your tests are solid and plentiful, the computer will catch your mistakes and you can go fix them.

There is no corresponding “proof checker” for mathematics. There is no compiler to tell you that it’s nonsensical to construct the set of all sets, or that it’s a type error to quotient a set by something that’s not an equivalence relation. The only way to get feedback is to seek out other people who do mathematics and ask their opinion. In solo, mathematics involves a lot of backtracking, revising mistaken assumptions, and stretching an idea to its breaking point to see that it didn’t even make sense to begin with. This is “bug hunting” in mathematics, and it can often completely destroy a proof and make one start over from scratch. It feels like writing a few hundred lines of code only to have the final program run “rm -rf *” on the directory containing it. It can be really. really. depressing.

It is an interesting pedagogical question in my mind whether there is a way to introduce proofs and the language of mature mathematics in a way that stays within a stone’s throw of computer programs. It seems like a worthwhile effort, but I can’t think of anyone who has sought to replace a classical mathematics education entirely with one based on computation.

## Mathematical syntax

Another major reason programmers are unwilling to give mathematics an honest effort is the culture of mathematical syntax: it’s ambiguous, and there’s usually nobody around to explain it to you. Let me start with an example of why this is not a problem in programming. Let’s say we’re reading a Python program and we see an expression like this:

foo[2]

The nature of (most) programming languages dictates that there are a small number of ways to interpret what’s going on in here:

1. foo could be a list/tuple, and we’re accessing the third element in it.
2. foo could be a dictionary, and we’re looking up value associated to the key 2.
3. foo could be a string, and we’re extracting the third character.
4. foo could be a custom-defined object, whose __getitem__ method is defined somewhere else and we can look there to see exactly what it does.

There are probably other times this notation can occur (although I’d be surprised if number 4 didn’t by default capture all possible uses), but the point is that any programmer reading this program knows enough to intuit that square brackets mean “accessing an item inside foo with identifier 2.” Part of the reasons that programs can be very easy to read is precisely because someone had to write a parser for a programming language, and so they had to literally enumerate all possible uses of any expression form.

The other extreme is the syntax of mathematics. The daunting fact is that there is no bound to what mathematical notation can represent, and much of mathematical notation is inherently ad hoc. For instance, if you’re reading a math paper and you come across an expression that looks like this

$\delta_i^j$

The possibilities of what this could represent are literally endless. Just to give the unmathematical reader a taste: $\delta_i$ could be an entry of a sequence of numbers of which we’re taking arithmetic $j^\textup{th}$ powers. The use of the letter delta could signify a slightly nonstandard way to write the Kronecker delta function, for which $\delta_i^j$ is one precisely when $i=j$ and zero otherwise. The superscript $j$ could represent dimension. Indeed, I’m currently writing an article in which I use $\delta^k_n$ to represent $k$-dimensional simplex numbers, specifically because I’m relating the numbers to geometric objects called simplices, and the letter for those is  a capital $\Delta$. The fact is that using notation in a slightly non-standard way does not invalidate a proof in the way that it can easily invalidate a program’s correctness.

What’s worse is that once mathematicians get comfortable with a particular notation, they will often “naturally extend” or even silently drop things like subscripts and assume their reader understands and agrees with the convenience! For example, here is a common difficulty that beginners face in reading math that involves use of the summation operator. Say that I have a finite set of numbers whose sum I’m interested in. The most rigorous way to express this is not far off from programming:

Let $S = \left \{ x_1, \dots, x_n \right \}$ be a finite set of things. Then their sum is finite:

$\displaystyle \sum_{i=1}^n x_i$

The programmer would say “great!” Assuming I know what “+” means for these things, I can start by adding $x_1 + x_2$, add the result to $x_3$, and keep going until I have the whole sum. This is really just a left fold of the plus operator over the list $S$.

But for mathematicians, the notation is far more flexible. For instance, I could say

Let $S$ be finite. Then $\sum_{x \in S} x$ is finite.

Things are now more vague. We need to remember that the $\in$ symbol means “in.” We have to realize that the strict syntax of having an iteration variable $i$ is no longer in effect. Moreover, the order in which the things are summed (which for a left fold is strictly prescribed) is arbitrary. If you asked any mathematician, they’d say “well of course it’s arbitrary, in an abelian group addition is commutative so the order doesn’t matter.” But realize, this is yet another fact that the reader must be aware of to be comfortable with the expression.

But it still gets worse.

In the case of the capital Sigma, there is nothing syntactically stopping a mathematician from writing

$\displaystyle \sum_{\sigma \in \Sigma} f_{\Sigma}(\sigma)$

Though experienced readers may chuckle, they will have no trouble understanding what is meant here. That is, syntactically this expression is unambiguous enough to avoid an outcry: $\Sigma$ just happens to also be a set, and saying $f_{\Sigma}$ means that the function $f$ is constructed in a way that depends on the choice of the set $\Sigma$. This often shows up in computer science literature, as $\Sigma$ is a standard letter to denote an alphabet (such as the binary alphabet $\left \{ 0,1 \right \}$).

One can even take it a step further and leave out the set we’re iterating over, as in

$\displaystyle \sum_{\sigma} f_{\Sigma}(\sigma)$

since it’s understood that the lowercase letter ($\sigma$) is usually an element of the set denoted by the corresponding uppercase letter ($\Sigma$). If you don’t know greek and haven’t seen that coincidence enough times to recognize it, you would quickly get lost. But programmers must realize: this is just the mathematician’s secret handshake. A mathematician would be just as bewildered and confused upon seeing some of the pointer arithmetic hacks C programmers invent, or the always awkward infinite for loop, if they had not had enough experience dealing with the syntax of standard for loops.

for (;;) {
;
}

In fact, a mathematician would look at this in disgust! The fact that the C programmer has need for something as pointless as an “empty statement” should be viewed as a clumsy inelegance in the syntax of the programming language (says the mathematician). Since mathematicians have the power to change their syntax at will, they would argue there’s no good reason not to change it, if it were a mathematical expression, to something simpler.

And once the paper you’re reading is over, and you start reading a new paper, chances are their conventions and notation will be ever-so-slightly different, and you have to keep straight what means what. It’s as if the syntax of a programming language changed depending on who was writing the program!

Perhaps understandably, the frustration that most mathematicians feel when dealing with varying syntax across different papers and books is collectively called “technicalities.” And the more advanced the mathematics becomes, the ability to fluidly transition between high-level intuition and technical details is all but assumed.

The upshot of this whole conversation is that the reader of a mathematical proof must hold in mind a vastly larger body of absorbed (and often frivolous) knowledge than the reader of a computer program.

At this point you might see all of this as my complaining, but in truth I’m saying this notational flexibility and ambiguity is a benefit. Once you get used to doing mathematics, you realize that technical syntax can make something which is essentially simple seem much more difficult than it is. In other words, we absolutely must have a way to make things completely rigorous, but in developing and presenting proofs the most important part is to make the audience understand the big picture, see intuition behind the symbols, and believe the proofs. For better or worse, mathematical syntax is just a means to that end, and the more abstract the mathematics becomes, the more flexiblility mathematicians need to keep themselves afloat in a tumultuous sea of notation.

## You’re on your own, unless you’re around mathematicians

That brings me to my last point: reading mathematics is much more difficult than conversing about mathematics in person. The reason for this is once again cultural.

Imagine you’re reading someone else’s program, and they’ve defined a number of functions like this (pardon the single-letter variable names; as long as one is willing to be vague I prefer single-letter variable names to “foo/bar/baz”).

def splice(L):
...

def join(*args):
...

def flip(x, y):
...

There are two parts to understanding how these functions work. The first part is that someone (or a code comment) explains to you in a high level what they do to an input. The second part is to weed out the finer details. These “finer details” are usually completely spelled out by the documentation, but it’s still a good practice to experiment with it yourself (there is always the possibility for bugs or unexpected features, of course).

In mathematics there is no unified documentation, just a collective understanding, scattered references, and spoken folk lore. You’re lucky if a textbook has a table of notation in the appendix. You are expected to derive the finer details and catch the errors yourself. Even if you are told the end result of a proposition, it is often followed by, “The proof is trivial.” This is the mathematician’s version of piping output to /dev/null, and literally translates to, “You’re expected to be able to write the proof yourself, and if you can’t then maybe you’re not ready to continue.”

Indeed, the opposite problems are familiar to a beginning programmer when they aren’t in a group of active programmers. Why is it that people give up or don’t enjoy programming? Is it because they have a hard time getting honest help from rudely abrupt moderators on help websites like stackoverflow? Is it because often when one wants to learn the basics, they are overloaded with the entirety of the documentation and the overwhelming resources of the internet and all its inhabitants? Is it because compiler errors are nonsensically exact, but very rarely helpful? Is it because when you learn it alone, you are bombarded with contradicting messages about what you should be doing and why (and often for the wrong reasons)?

All of these issues definitely occur, and I see them contribute to my students’ confusion in my introductory Python class all the time. They try to look on the web for information about how to solve a very basic problem, and they come back to me saying they were told it’s more secure to do it this way, or more efficient to do it this way, or that they need to import something called the “heapq module.” When really the goal is not to solve the problem in the best way possible or in the shortest amount of code, but to show them how to use the tools they already know about to construct a program that works. Without a guiding mentor it’s extremely easy to get lost in the jungle of people who think they know what’s best.

As far as I know there is no solution to this problem faced by the solo programming student (or the solo anything student). And so it stands for mathematics: without others doing mathematics with you, its very hard to identify your issues and see how to fix them.

## Proofs, Syntax, and Community

For the programmer who is truly interested in improving their mathematical skills, the first line of attack should now be obvious. Become an expert at applying the basic methods of proof. Second, spend as much time as it takes to clear up what mathematical syntax means before you attempt to interpret the semantics. And finally, find others who are interested in seriously learning some mathematics, and work on exercises (perhaps a weekly set) with them. Start with something basic like set theory, and write your own proofs and discuss each others’ proofs. Treat the sessions like code review sessions, and be the compiler to your partner’s program. Test their arguments to the extreme, and question anything that isn’t obvious or trivial. It’s not uncommon for easy questions with simple answers and trivial proofs to create long and drawn out discussions before everyone agrees it’s obvious. Embrace this and use it to improve.

Short of returning to your childhood and spending more time doing recreational mathematics, that is the best advice I can give.

Until next time!