Silent Duels—Parsing the Construction

Last time we discussed the setup for the silent duel problem: two players taking actions in [0,1], player 1 gets n chances to act, player 2 gets m, and each knows their probability of success when they act.

The solution is in a paper of Rodrigo Restrepo from the 1950s. In this post I’ll start detailing how I study this paper, and talk through my thought process for approaching a bag of theorems and proofs. If you want to follow along, I re-typeset the paper on Github.

Game Theory Basics

The Introduction starts with a summary of the setting of game theory. I remember most of this so I will just summarize the basics of the field. Skip ahead if you already know what the minimax theorem is, and what I mean when I say the “value” of a game.

A two-player game consists of a set of actions for each player—which may be finite or infinite, and need not be the same for both players—and a payoff function for each possible choice of actions. The payoff function is interpreted as the “utility” that player 1 gains and player 2 loses. If the payoff is negative, you interpret it as player 1 losing utility to player 2. Utility is just a fancy way of picking a common set of units for what each player treasures in their heart of hearts. Often it’s stated as money and we assume both players value cash the same way. Games in which the utility is always “one player gains exactly the utility lost by the other player” are called zero-sum.

With a finite set of actions, the payoff function is a table. For rock-paper-scissors the table is:

Rock, paper: -1
Rock, scissors: 1
Rock, rock: 0
Paper, paper: 0
Paper, scissors: -1
Paper, rock: 1
Scissors, paper: 1
Scissors, scissors: 0
Scissors, rock: -1

You could arrange this in a matrix and analyze the structure of the matrix, but we won’t. It doesn’t apply to our forthcoming setting where the players have infinitely many strategies.

A strategy is a possibly-randomized algorithm (whose inputs are just the data of the game, not including any past history of play) that outputs an action. In some games, the optimal strategy is to choose a single action no matter what your opponent does. This is sometimes called a pure, dominating strategy, not because it dominates your opponent, but because it’s better than all of your other options no matter what your opponent does. The output action is deterministic.

However, as with rock-paper-scissors, the optimal strategy for most interesting games requires each player to act randomly according to a fixed distribution. Such strategies are called mixed or randomized. For rock-paper-scissors, the optimal strategy is to choose rock, paper, and scissors with equal probability.  Computers are only better than humans at rock-paper-scissors because humans are bad at behaving consistently and uniformly random.

The famous minimax theorem says that every two-player zero-sum game has an optimal strategy for each player, which is possibly randomized. This strategy is optimal in the sense that it maximizes your expected winnings no matter what your opponent does. However, if your opponent is playing a particularly suboptimal strategy, the minimax solution might not be as good as a solution that takes advantage of the opponent’s dumb choices. A uniform random rock-paper-scissors strategy is not optimal if your opponent always plays “rock.”  However, the optimal strategy doesn’t need special knowledge or space to store information about past play. If you played against God, you would blindly use the minimax strategy and God would have no upper hand. I wonder if the pope would have excommunicated me for saying that in the 1600’s.

The expected winnings for player 1 when both players play a minimax-optimal strategy is called the value of the game, and this number is unique (even if there are possibly multiple optimal strategies). If a game is symmetric—meaning both players have the same actions and the payoff function is symmetric—then the value is guaranteed to be zero. The game is fair.

The version of the minimax theorem that most people use (in particular, the version that often comes up in theoretical computer science) shows that finding an optimal strategy is equivalent to solving a linear program. This is great because it means that any such (finite) game is easy to solve. You don’t need insight; just compile and run. The minimax theorem is also true for sufficiently well-behaved continuous action spaces. The silent duel is well-behaved, so our goal is to compute an explicit, easy-to-implement strategy that the minimax theorem guarantees exists. As a side note, here is an example of a poorly-behaved game with no minimax optimum.

While the minimax theorem guarantees optimal strategies and a value, the concept of the “value” of the game has an independent definition:

Let X, Y be finite sets of actions for players 1, 2 respectively, and p(x), q(y) be strategies, i.e., probability distributions over X and Y so that p(x) is the probability that x is chosen. Let \Psi(x, y) be the payoff function for the game. The value of the game is a real number v such that there exist two strategies p, q with the two following properties. First, for every fixed y \in Y,

\displaystyle \sum_{x \in X} p(x) \Psi(x, y) \geq v

(no matter what player 2 does, player 1’s strategy guarantees at least v payoff), and for every fixed x \in X,

\displaystyle \sum_{y \in Y} q(y) \Psi(x, y) \leq v

(no matter what player 1 does, player 2’s strategy prevents a loss of more than v).

Since silent duels are continuous, Restrepo opens the paper with the corresponding definition for continuous games. Here a probability distribution is the same thing as a “positive measure with total measure 1.” Restrepo uses F and G for the strategies, and the corresponding statement of expected payoff for player 1 is that, for all fixed actions y \in Y,

\displaystyle \int \Psi(x, y) dF(x) \geq v

And likewise, for all x \in X,

\displaystyle \int \Psi(x, y) dG(y) \leq v

All of this background gets us through the very first paragraph of the Restrepo paper. As I elaborate in my book, this is par for the course for math papers, because written math is optimized for experts already steeped in the context. Restrepo assumes the reader knows basic game theory so we can get on to the details of his construction, at which point he slows down considerably to focus on the details.

Description of the Optimal Strategies

Starting in section 2, Restrepo describes the construction of the optimal strategy, but first he explains the formal details of the setting of the game. We already know the two players are taking n and m actions between 0 \leq t \leq 1, but we also fix the probability of success. Player 1 knows a distribution P(t) on [0,1] for which P(t) is the probability of success when acting at time t. Likewise, player 2 has a possibly different distribution Q(t), and (crucially) P(t), Q(t) both increase continuously on [0,1]. (In section 3 he clarifies further that P satisfies P(0) = 0, P(1) = 1, and P'(t) > 0, likewise for Q(t).) Moreover, both players know both P, Q. One could say that each player has an estimate of their opponent’s firing accuracy, and wants to be optimal compared to that estimate.

The payoff function \Psi(x, y) is defined informally as: 1 if Player one succeeds before Player 2, -1 if Player 2 succeeds first, and 0 if both players exhaust their actions before the end and none succeed. Though Restrepo does not state it, if the players act and succeed at the same time—say both players fire at time t=1—the payoff should also be zero. We’ll see how this is converted to a more formal (and cumbersome!) mathematical definition in a future post.

Next we’ll describe the statement of the fully general optimal strategy (which will be essentially meaningless, but have some notable features we can infer information from), and get a sneak peek at how to build this strategy algorithmically. Then we’ll see a simplified example of the optimal strategy.

The optimal strategy presented depends only on the values n, m (the number of actions each player gets) and their success probability distributions P, Q. For player 1, the strategy splits up [0,1] into subintervals

\displaystyle [a_i, a_{i+1}] \qquad 0 < a_1 < a_2, < \cdots < a_n < a_{n+1} = 1

Crucially, this strategy ignores the initial interval [0, a_1]. In each other subinterval Player 1 attempts an action at a time chosen by a probability distribution specific to that interval, independently of previous attempts. But no matter what, there is some initial wait time during which no action will ever be taken. This makes sense: if player 1 fired at time 0, it is a guaranteed wasted shot. Likewise, firing at time 0.000001 is basically wasted (due to continuity, unless P(t) is obnoxiously steep early on).

Likewise for player 2, the optimal strategy is determined by numbers b_1, \dots, b_m resulting in m intervals [b_j, b_{j+1}] with b_{m+1} = 1.

The difficult part of the construction is describing the distributions dictating when a player should act during an interval. It’s difficult because an interval for player 1 and player 2 can overlap partially. Maybe a_2 = 0.5, a_3 = 0.75 and b_1 = 0.25, b_2 = 0.6. Player 1 knows that Player 2 (using their corresponding minimax strategy) must act before time t = 0.6, and gets another chance after that time. This suggests that the distribution determining when Player 1 should act within [a_2, a_3] may have a discontinuous jump at t = 0.6.

Call F_i the distribution for Player 1 to act in the interval [a_i, a_{i+1}]. Since it is a continuous distribution, Restrepo uses F_i for the cumulative distribution function and dF_i for the probability density function. Then these functions are defined by (note this should be mostly meaningless for the moment)

\displaystyle dF_i(x_i) = \begin{cases} h_i f^*(x_i) dx_i & \textup{ if } a_i < x_i < a_{i+1} \\ 0 & \textup{ if } x_i \not \in [a_i, a_{i+1}] \\ \end{cases}

where f^* is defined as

\displaystyle f^*(t) = \prod_{b_j > t} \left [ 1 - Q(b_j) \right ] \frac{Q'(t)}{Q^2(t) P(t)}.

The constants h_i and h_{i+1} are related by the equation

\displaystyle h_i = [1 - D_i] h_{i+1},

where

\displaystyle D_i = \int_{a_i}^{a_{i+1}} P(t) dF_i(t)

What can we glean from this mashup of symbols? The first is that (obviously) the distribution is zero outside the interval [a_i, a_{i+1}]. Within it, there is this mysterious h_i that is related to the h_{i+1} used to define the next interval’s probability. This suggests we will likely build up the strategy in reverse starting with F_n as the “base case” (if n=1, then it is the only one).

Next, we notice the curious definition of f^*. It unsurprisingly requires knowledge of both P and Q, but the coefficient is strangely chosen: it’s a product over all failure probabilities (1 - Q(b_j)) of all interval-starts happening later for the opponent.

[Side note: it’s very important that this is a constant; when I first read this, I thought that it was \prod_{b_j > t}[1 - Q(t)], which makes the eventual task of integrating f^* much harder.]

Finally, the last interval (the one ending at t=1) may include the option to simply “wait for a guaranteed hit,” which Restrepo calls a “discrete mass of \alpha at t=1.” That is, F_n may have a different representation than the rest. Indeed, at the end of the paper we will find that Restrepo gives a base-case definition for h_n that allows us to bootstrap the construction.

Player 2’s strategy is the same as Player 1’s, but replacing the roles of P, Q, n, m, a_i, b_j in the obvious way.

The symmetric example

As with most math research, the best way to parse a complicated definition or construction is to simplify the different aspects of the problem until they become tractable. One way to do this is to have only a single action for both players, with P = Q. Restrepo provides a more general example to demonstrate, which results in the five most helpful lines in the paper. I’ll reproduce them here verbatim:

EXAMPLE. Symmetric Game: P(t) = Q(t), and n = m. In this case the two
players have the same optimal strategies; \alpha = 0, and a_k = b_k, k=1, \dots, n. Furthermore

\displaystyle \begin{aligned} P(a_{n-k}) &= \frac{1}{2k+3} & k = 0, 1, \dots, n-1, \\ dF_{n-k}(t) &= \frac{1}{4(k+1)} \frac{P'(t)}{P^3(t)} dt & a_{n-k} < t < a_{n-k+1}. \end{aligned}

Saying \alpha = 0 means there is no “wait until t=1 to guarantee a hit”, which makes intuitive sense. You’d only want to do that if your opponent has exhausted all their actions before the end, which is only likely to happen if they have fewer actions than you do.

When Restrepo writes P(a_{n-k}) = \frac{1}{2k+3}, there are a few things happening. First, we confirm that we’re working backwards from a_n. Second, he’s implicitly saying “choose a_{n-k} such that P(a_{n-k}) has the desired cumulative density.” After a bit of reflection, there’s no other way to specify the a_i except implicitly: we don’t have a formula for P to lean on.

Finally, the definition of the density function dF_{n-k}(t) helps us understand under what conditions the probability function would be increasing or decreasing from the start of the interval to the end. Looking at the expression P'(t) / P^3(t), we can see that polynomials will result in an expression dominated by 1/t^k for some k, which is decreasing. By taking the derivative, an increasing density would have to be built from a P satisfying P''(t) P(t) - 3(P'(t))^2 > 0. However, I wasn’t able to find any examples that satisfy this. Polynomials, square roots, logs and exponentials, all seem to result in decreasing density functions.

Finally, we’ll plot two examples. The first is the most reductive: P(t) = Q(t) = t, and n = m = 1. In this case n=1, and there is only one term k=0, for which a_n = 1/3. Then dF_1(t) = 1/4t^3. (For verification, note the integral of dF_1 on [1/3, 1] is indeed 1).

restrepo-1-over-4tcubed.png

With just one action and P(t) = Q(t) = t, the region before t=1/3 has zero probability, and the probability decreases from 6.75 to 1/4.

Note that the reason a_n = 1/3 is so nice is that P(t) is so simple. If P(t) were, say, t^2, then a_n should shift to being \sqrt{1/3}. If P(t) were more complicated, we’d have to invert it (or use an approximate search) to find the location a_n for which P(a_n) = 1/3.

Next, we loosen the example to let n=m=4, still with P(t) = Q(t) = t. In this case, we have the same final interval [1/3,1]. The new actions all occur in the time before t=1/3, in the intervals [1/5, 1/3], [1/7, 1/5], [1/9,1/7]. If there were more actions, we’d get smaller inverse-of-odd-spaced intervals approaching zero. The probability densities are now steeper versions of the same 1/4t^3, with the constant getting smaller to compensate for the fact that 1/t^3 gets larger and maintain the normalized distribution. For example, the earliest interval results in \int_{1/9}^{1/7} \frac{1}{16t^3} dt = 1. Closer to zero the densities are somewhat shallower compared to the size of the interval; for example in [1/9, 1/7], the density toward the beginning of the interval is only about twice as large as the density toward the end.

restrepo-four-actions.png

The combination of the four F_i’s for the four intervals in which actions are taken. This is a complete description of the optimal strategy for our simple symmetric version of the silent duel.

Since the early intervals are getting smaller and smaller as we add more actions, the optimal strategy will resemble a burst of action at the beginning, gradually tapering off as the accuracy increases and we work through our budget. This is an explicit tradeoff between the value of winning (lots of early, low probability attempts) and keeping some actions around for the end where you’re likely to succeed.

Next step: get to the example from the general theorem

At this point, we’ve parsed the general statement of the theorem, and while much of it is still mysterious, we extracted some useful qualitative information from the statement, and tinkered with some simple examples.

At this point, I have confidence that the simple symmetric example Restrepo provided is correct; it passed some basic unit tests, like that each dF_i is normalized. My next task in fully understanding the paper is to be able to derive the symmetric example from the general construction. We’ll do this next time, and include a program that constructs the optimal solution for any input.

Until then!

 

Reservoir Sampling

Problem: Given a data stream of unknown size n, pick an entry uniformly at random. That is, each entry has a 1/n chance of being chosen.

Solution: (in Python)

import random

def reservoirSample(stream):
   for k,x in enumerate(stream, start=1):
      if random.random() < 1.0 / k:
         chosen = x

   return chosen

Discussion: This is one of many techniques used to solve a problem called reservoir sampling. We often encounter data sets that we’d like to sample elements from at random. But with the advent of big data, the lists involved are so large that we either can’t fit it all at once into memory or we don’t even know how big it is because the data is in the form of a stream (e.g., the number of atomic price movements in the stock market in a year). Reservoir sampling is the problem of sampling from such streams, and the technique above is one way to achieve it.

In words, the above algorithm holds one element from the stream at a time, and when it inspects the k-th element (indexing k from 1), it flips a coin of bias 1/k to decide whether to keep its currently held element or to drop it in favor of the new one.

We can prove quite easily by induction that this works. Indeed, let n be the (unknown) size of the list, and suppose n=1. In this case there is only one element to choose from, and so the probability of picking it is 1. The case of n=2 is similar, and more illustrative. Now suppose the algorithm works for n and suppose we increase the size of the list by 1 adding some new element y to the end of the list. For any given x among the first n elements, the probability we’re holding x when we  inspect y is 1/n by induction. Now we flip a coin which lands heads with probability 1/(n+1), and if it lands heads we take y and otherwise we keep x. The probability we get y is exactly 1/(n+1), as desired, and the probability we get x is \frac{1}{n}\frac{n}{n+1} = \frac{1}{n+1}. Since x was arbitrary, this means that after the last step of the algorithm each entry is held with probability 1/(n+1).

\square

It’s easy to see how one could increase the number of coins being flipped to provide a sampling algorithm to pick any finite number of elements (with replacement, although a variant without replacement is not so hard to construct using this method). Other variants, exist, such as distributed and weighted sampling.

Python’s generators make this algorithm for reservoir sampling particularly nice. One can define a generator which abstractly represents a data stream (perhaps querying the entries from files distributed across many different disks), and this logic is hidden from the reservoir sampling algorithm. Indeed, this algorithm works for any iterable, although if we knew the size of the list we could sample much faster (by uniformly generating a random number and indexing the list appropriately). The start parameter given to the enumerate function makes the k variable start at 1.

Methods of Proof — Induction

In this final post on the basic four methods of proof (but perhaps not our last post on proof methods), we consider the proof by induction.

Proving Statements About All Natural Numbers

Induction comes in many flavors, but the goal never changes. We use induction when we want to prove something is true about all natural numbers. These statements will look something like this:

For all natural numbers n, 1 + 2 + \dots + n = n(n+1)/2.

Of course, there are many ways to prove this fact, but induction relies on one key idea: if we know the statement is true for some specific number n, then that gives us information about whether the statement is true for n+1. If this isn’t true about the problem, then proof by induction is hopeless.

Let’s see how we can apply it to the italicized statement above (though we haven’t yet said what induction is, this example will pave the way for a formal description of the technique). The first thing we notice is that indeed, if we know something about the first n numbers, then we can just add n+1 to it to get the sum of the first n+1 numbers. That is,

\displaystyle 1 + \dots + n + (n+1) = (1 + \dots + n) + (n+1)

Reiterating our key point, this is how we know induction is a valid strategy: the statement written for a fixed n translates naturally into the statement about n+1. Now suppose we know the theorem is true for n. Then we can rewrite the above sum as follows:

\displaystyle 1 + \dots + n + (n+1) = \frac{n(n+1)}{2} + (n+1)

With some algebra, we can write the left-hand side as a single fraction:

\displaystyle 1 + \dots + (n+1) = \frac{n(n+1) + 2(n+1)}{2}

and factoring the numerator gives

\displaystyle 1 + \dots + (n+1) = \frac{(n+1)(n+2)}{2}

Indeed, this is precisely what we’re looking for! It’s what happens when you replace n by n+1 in the original statement of the problem.

At this point we’re very close to being finished. We proved that if the statement is true for n, then it’s true for n+1. And by the same reasoning, it will be true for n+2, n+3, and all numbers after n. But this raises the obvious question: what’s the smallest number that it’s true for?

For this problem, it’s easy to see the answer is n=1. A mathematician would say: the statement is trivially true for n=1 (here trivial means there is no thinking required to show it: you just plug in n=1 and verify). And so by our reasoning, the statement is true for n=2, n=3, and so on forever. This is the spirit of mathematical induction.

Formal Nonsense

Now that we’ve got a taste of how to use induction in practice, let’s formally write down the rules for induction. Let’s have a statement which depends on a number n, and call it p(n). This is written as a function because it actually is one (naively). It’s a function from the set of natural numbers to the set of all mathematical statements. In our example above, p(n) was the statement that the equality 1 + \dots + n = n(n+1)/2 holds.

We can plug in various numbers into this function, such as p(1) being the statement “1 = 1(1+1)/2 holds,” or p(n+1) being “1 + \dots + (n+1) = (n+1)(n+1+1)/2 holds.”

The basic recipe for induction is then very simple. Say you want to prove that p(n) is true for all n \geq 1. First you prove that p(1) is true (this is called the base case), and then you prove the implication p(n) \to p(n+1) for an arbitrary n. Each of these pieces can be proved with any method one wishes (direct, contrapositive, contradiction, etc.). Once they are proven, we get that p(n) is true for all n automatically.

Indeed, we can even prove it. A rigorous proof requires a bit of extra knowledge, but we can easily understand the argument: it’s just a proof by contradiction. Here’s a quick sketch. Let X be the set of all natural numbers n for which p(n) is false. Suppose to the contrary that X is not empty. Every nonempty set of natural numbers has a smallest element, so let’s call m the smallest number for which p(m) is false. Now m-1 < m, so p(m-1) must be true. But we proved that whenever p(n) is true then so is p(n+1), so p(m-1 + 1) = p(m) is true, a contradiction.

Besides practicing proof by induction, that’s all there is to it. One more caveat is that the base case can be some number other than 1. For instance, it is true that n! > 2^n, but only for n \geq 4. In this case, we prove p(4) is true, and p(n) \to p(n+1) with the extra assumption that n \geq 4. The same inductive result applies.

Here are some exercises the reader can practice with, and afterward we will explore some variants of induction.

  1. Prove that n! > 2^n for all n \geq 4.
  2. Prove that for all n \geq 1 the following equality holds: 1/(1 \cdot 2) + 1/(2 \cdot 3) + \dots + 1/(n \cdot (n+1)) = n/(n+1).
  3. Prove that for every natural number n, a set of n elements has 2^n subsets (including the empty subset).

This last exercise gives a hint that induction can prove more than arithmetic formulas. Indeed, if we have any way to associate a mathematical object with a number, we can potentially use induction to prove things about those objects. Unfortunately, we don’t have any mathematical objects to work with (except for sets and functions), and so we will stick primarily to proving facts about numbers.

One interesting observation about proof by induction is very relevant to programmers: it’s just recursion. That is, if we want to prove a statement p(n), it suffices to prove it for p(n-1) and do some “extra computation” to arrive at the statement for p(n). And of course, we want to make sure the recursion terminates, so we add in the known result for p(1).

Variations on Induction, and Connections to Dynamic Programming

The first variation of induction is simultaneous induction on multiple quantities. That is, we can formulate a statement p(n,m) which depends on two natural numbers independently of one another. The base case is a bit trickier, but paralleling the above remark about recursion, double-induction follows the same pattern as a two-dimensional dynamic programming algorithm. The base cases would consist of all p(1,m) and all p(n,1), and the inductive step to get p(n,m) requires p(n-1,m) and p(n,m-1) (and potentially p(n-1, m-1) or others; it depends on the problem).

Unfortunately, natural instances where double induction is useful (or anywhere close to necessary) are rare. Here is an example of a (tricky) but elementary example. Call

\displaystyle C(m,n) = \frac{(2m)!(2n)!}{m!n!(m+n)!},

where the exclamation point denotes the factorial function. We will outline a proof that C(m,n) is always an integer for all m, n \geq 0. If we look at the base cases, C(0,n), C(m,0) (recalling that 0! = 1), we get (2n!)/(n! n!), and this happens to be in the form of a binomial coefficient (here, the number of ways to choose n! objects from a collection of (2n)! objects), and binomial coefficients are known to be integers. Now the inductive step relies on the fact that C(m,n-1) and C(m+1, n-1) are both integers. If this is true then

\displaystyle C(m,n) = 4C(m,n-1) - C(m+1, n-1),

which is obviously again an integer.

If we write these values in a table, we can see how the induction progresses in a “dynamic programming” fashion:

dynamic-programming-table

In order to fill the values in the next n column (prove the statement for those values of n), we need to fill the entire n-1 column (for indeed, we rely on the inductive hypothesis for both the m+1 and m row). But since our base case was the entire n=0 column, we can fill the entire table. In fact, we have just described a dynamic programming algorithm for computing the value of C(m,n) in mn steps. The correctness of the algorithm is indeed an inductive proof of the theorem.

Perhaps uninterestingly, this is asymptotically slower than the naive algorithm of computing C(m,n) directly by computing (2n)!, (2m)!, (n+m)! directly; this would take a linear number of steps in n, assuming n > m. In passing, this author wonders if, when the numbers are really large, the lack of division and multiplication in the dynamic program (multiplying by 4 using bit shifting instead) would overtake the naive algorithm. But C(m,n) is certainly not interesting enough in its own right for anyone to care 🙂

At this point, we have noticed that we sometimes use induction and assume that many smaller instances of the statement are true. Indeed, why not inductively assume that the statement holds for all smaller n. This would certainly give the prover more tools to work with. Indeed, this technique is sometimes called strong induction, in the sense that we assume a stronger inductive hypothesis when we’re trying to prove p(n+1). It may not be entirely obvious (especially to one well versed in the minutiae of formal logic) that this is actually equivalent to normal induction, but it is. In fact, the concept of “strong” induction is entirely pedagogical in nature. Most working mathematicians will not mention the difference in their proofs.

The last variant we’ll mention about induction is that of transfinite induction. The concept is that if you have any set X which is well-ordered (essentially this means: allowing one to prove induction applies as we did earlier in the post), then we can perform induction its elements. In this way, we can “parameterize” statements by elements of an arbitrary well-ordered set, so that instead of p(n) being a function from natural numbers to mathematical statements, it’s a function from X to mathematical statements. One somewhat common example of when X is something besides natural numbers is when we use the so-called cardinal numbers. We have already seen two distinct infinite cardinal numbers in this series: the cardinality of the integers and the cardinality of the real numbers (indeed, “cardinal number” just means a number which is the cardinality of a set). It turns out that there are many more kinds of cardinal numbers, and you can do induction on them, but this rarely shows up outside of mathematical logic.

And of course, we should close this post on an example of when induction goes wrong. For this we point the reader to our proof gallery, and the false proof that all horses are the same color. It’s quite an amusing joke, and hopefully it will stimulate the reader’s mind to recognize the pitfalls that can occur in a proof by induction.

So those are the basic four proof techniques! Fortunately for the reader, pretty much all proofs presented on this blog follow one of these four techniques. I imagine many of my readers skip over the proofs entirely (if only I could put proofs in animated gifs, with claims illustrated by grumpy cats!). So hopefully, if you have been intimidated or confused by the structure of the proofs on this blog, this will aid you in your future mathematical endeavors.  Butchering an old phrase for the sake of a bad pun, the eating of the pudding is in the proof. 🙂

Until next time!

False Proof – All Horses are the Same Color

Problem: Show that all horses are of the same color.

“Solution”: We will show, by induction, that for any set of n horses, every horse in that set has the same color. Suppose n=1, this is obviously true. Now suppose for all sets of n horses, every horse in the set has the same color. Consider any set H of n+1 horses. We may pick a horse at random, h_1 \in H, and remove it from the set, getting a set of n horses. By the inductive hypothesis, all of the n remaining horses are the same color.

On the other hand, if we remove a different horse h_2 \in H, we again get a set of n horses which are all the same color. Let us call this color “brown,” just to give it a name. In particular, h_1 is brown. But when we removed h_1, we got that all the remaining horses had the same color as h_2. So h_2 must also be brown. Hence, all horses in H are brown.

Explanation: The argument here is valid for almost all n. For large n, it is a very natural argument that follows from the inductive hypothesis. However, it fails for n=2 (and this is the only time it fails). By having only two horses, we cannot conclude that the removed horses have the same color, because there aren’t any horses remaining in H to apply the transitivity of “having the same color as.”

This false proof highlights the danger of neglecting the base case of an inductive argument. Here the true base case was not n=1, but rather n=2. Since the base case is false, we should have prudently stopped our argument there before embarrassing ourselves.