Reservoir Sampling

Problem: Given a data stream of unknown size n, pick an entry uniformly at random. That is, each entry has a 1/n chance of being chosen.

Solution: (in Python)

import random

def reservoirSample(stream):
   for k,x in enumerate(stream, start=1):
      if random.random() < 1.0 / k:
         chosen = x

   return chosen

Discussion: This is one of many techniques used to solve a problem called reservoir sampling. We often encounter data sets that we’d like to sample elements from at random. But with the advent of big data, the lists involved are so large that we either can’t fit it all at once into memory or we don’t even know how big it is because the data is in the form of a stream (e.g., the number of atomic price movements in the stock market in a year). Reservoir sampling is the problem of sampling from such streams, and the technique above is one way to achieve it.

In words, the above algorithm holds one element from the stream at a time, and when it inspects the k-th element (indexing k from 1), it flips a coin of bias 1/k to decide whether to keep its currently held element or to drop it in favor of the new one.

We can prove quite easily by induction that this works. Indeed, let n be the (unknown) size of the list, and suppose n=1. In this case there is only one element to choose from, and so the probability of picking it is 1. The case of n=2 is similar, and more illustrative. Now suppose the algorithm works for n and suppose we increase the size of the list by 1 adding some new element y to the end of the list. For any given x among the first n elements, the probability we’re holding x when we  inspect y is 1/n by induction. Now we flip a coin which lands heads with probability 1/(n+1), and if it lands heads we take y and otherwise we keep x. The probability we get y is exactly 1/(n+1), as desired, and the probability we get x is \frac{1}{n}\frac{n}{n+1} = \frac{1}{n+1}. Since x was arbitrary, this means that after the last step of the algorithm each entry is held with probability 1/(n+1).

\square

It’s easy to see how one could increase the number of coins being flipped to provide a sampling algorithm to pick any finite number of elements (with replacement, although a variant without replacement is not so hard to construct using this method). Other variants, exist, such as distributed and weighted sampling.

Python’s generators make this algorithm for reservoir sampling particularly nice. One can define a generator which abstractly represents a data stream (perhaps querying the entries from files distributed across many different disks), and this logic is hidden from the reservoir sampling algorithm. Indeed, this algorithm works for any iterable, although if we knew the size of the list we could sample much faster (by uniformly generating a random number and indexing the list appropriately). The start parameter given to the enumerate function makes the k variable start at 1.

Methods of Proof — Induction

In this final post on the basic four methods of proof (but perhaps not our last post on proof methods), we consider the proof by induction.

Proving Statements About All Natural Numbers

Induction comes in many flavors, but the goal never changes. We use induction when we want to prove something is true about all natural numbers. These statements will look something like this:

For all natural numbers n, 1 + 2 + \dots + n = n(n+1)/2.

Of course, there are many ways to prove this fact, but induction relies on one key idea: if we know the statement is true for some specific number n, then that gives us information about whether the statement is true for n+1. If this isn’t true about the problem, then proof by induction is hopeless.

Let’s see how we can apply it to the italicized statement above (though we haven’t yet said what induction is, this example will pave the way for a formal description of the technique). The first thing we notice is that indeed, if we know something about the first n numbers, then we can just add n+1 to it to get the sum of the first n+1 numbers. That is,

\displaystyle 1 + \dots + n + (n+1) = (1 + \dots + n) + (n+1)

Reiterating our key point, this is how we know induction is a valid strategy: the statement written for a fixed n translates naturally into the statement about n+1. Now suppose we know the theorem is true for n. Then we can rewrite the above sum as follows:

\displaystyle 1 + \dots + n + (n+1) = \frac{n(n+1)}{2} + (n+1)

With some algebra, we can write the left-hand side as a single fraction:

\displaystyle 1 + \dots + (n+1) = \frac{n(n+1) + 2(n+1)}{2}

and factoring the numerator gives

\displaystyle 1 + \dots + (n+1) = \frac{(n+1)(n+2)}{2}

Indeed, this is precisely what we’re looking for! It’s what happens when you replace n by n+1 in the original statement of the problem.

At this point we’re very close to being finished. We proved that if the statement is true for n, then it’s true for n+1. And by the same reasoning, it will be true for n+2, n+3, and all numbers after n. But this raises the obvious question: what’s the smallest number that it’s true for?

For this problem, it’s easy to see the answer is n=1. A mathematician would say: the statement is trivially true for n=1 (here trivial means there is no thinking required to show it: you just plug in n=1 and verify). And so by our reasoning, the statement is true for n=2, n=3, and so on forever. This is the spirit of mathematical induction.

Formal Nonsense

Now that we’ve got a taste of how to use induction in practice, let’s formally write down the rules for induction. Let’s have a statement which depends on a number n, and call it p(n). This is written as a function because it actually is one (naively). It’s a function from the set of natural numbers to the set of all mathematical statements. In our example above, p(n) was the statement that the equality 1 + \dots + n = n(n+1)/2 holds.

We can plug in various numbers into this function, such as p(1) being the statement “1 = 1(1+1)/2 holds,” or p(n+1) being “1 + \dots + (n+1) = (n+1)(n+1+1)/2 holds.”

The basic recipe for induction is then very simple. Say you want to prove that p(n) is true for all n \geq 1. First you prove that p(1) is true (this is called the base case), and then you prove the implication p(n) \to p(n+1) for an arbitrary n. Each of these pieces can be proved with any method one wishes (direct, contrapositive, contradiction, etc.). Once they are proven, we get that p(n) is true for all n automatically.

Indeed, we can even prove it. A rigorous proof requires a bit of extra knowledge, but we can easily understand the argument: it’s just a proof by contradiction. Here’s a quick sketch. Let X be the set of all natural numbers n for which p(n) is false. Suppose to the contrary that X is not empty. Every nonempty set of natural numbers has a smallest element, so let’s call m the smallest number for which p(m) is false. Now m-1 < m, so p(m-1) must be true. But we proved that whenever p(n) is true then so is p(n+1), so p(m-1 + 1) = p(m) is true, a contradiction.

Besides practicing proof by induction, that’s all there is to it. One more caveat is that the base case can be some number other than 1. For instance, it is true that n! > 2^n, but only for n \geq 4. In this case, we prove p(4) is true, and p(n) \to p(n+1) with the extra assumption that n \geq 4. The same inductive result applies.

Here are some exercises the reader can practice with, and afterward we will explore some variants of induction.

  1. Prove that n! > 2^n for all n \geq 4.
  2. Prove that for all n \geq 1 the following equality holds: 1/(1 \cdot 2) + 1/(2 \cdot 3) + \dots + 1/(n \cdot (n+1)) = n/(n+1).
  3. Prove that for every natural number n, a set of n elements has 2^n subsets (including the empty subset).

This last exercise gives a hint that induction can prove more than arithmetic formulas. Indeed, if we have any way to associate a mathematical object with a number, we can potentially use induction to prove things about those objects. Unfortunately, we don’t have any mathematical objects to work with (except for sets and functions), and so we will stick primarily to proving facts about numbers.

One interesting observation about proof by induction is very relevant to programmers: it’s just recursion. That is, if we want to prove a statement p(n), it suffices to prove it for p(n-1) and do some “extra computation” to arrive at the statement for p(n). And of course, we want to make sure the recursion terminates, so we add in the known result for p(1).

Variations on Induction, and Connections to Dynamic Programming

The first variation of induction is simultaneous induction on multiple quantities. That is, we can formulate a statement p(n,m) which depends on two natural numbers independently of one another. The base case is a bit trickier, but paralleling the above remark about recursion, double-induction follows the same pattern as a two-dimensional dynamic programming algorithm. The base cases would consist of all p(1,m) and all p(n,1), and the inductive step to get p(n,m) requires p(n-1,m) and p(n,m-1) (and potentially p(n-1, m-1) or others; it depends on the problem).

Unfortunately, natural instances where double induction is useful (or anywhere close to necessary) are rare. Here is an example of a (tricky) but elementary example. Call

\displaystyle C(m,n) = \frac{(2m)!(2n)!}{m!n!(m+n)!},

where the exclamation point denotes the factorial function. We will outline a proof that C(m,n) is always an integer for all m, n \geq 0. If we look at the base cases, C(0,n), C(m,0) (recalling that 0! = 1), we get (2n!)/(n! n!), and this happens to be in the form of a binomial coefficient (here, the number of ways to choose n! objects from a collection of (2n)! objects), and binomial coefficients are known to be integers. Now the inductive step relies on the fact that C(m,n-1) and C(m+1, n-1) are both integers. If this is true then

\displaystyle C(m,n) = 4C(m,n-1) - C(m+1, n-1),

which is obviously again an integer.

If we write these values in a table, we can see how the induction progresses in a “dynamic programming” fashion:

dynamic-programming-table

In order to fill the values in the next n column (prove the statement for those values of n), we need to fill the entire n-1 column (for indeed, we rely on the inductive hypothesis for both the m+1 and m row). But since our base case was the entire n=0 column, we can fill the entire table. In fact, we have just described a dynamic programming algorithm for computing the value of C(m,n) in mn steps. The correctness of the algorithm is indeed an inductive proof of the theorem.

Perhaps uninterestingly, this is asymptotically slower than the naive algorithm of computing C(m,n) directly by computing (2n)!, (2m)!, (n+m)! directly; this would take a linear number of steps in n, assuming n > m. In passing, this author wonders if, when the numbers are really large, the lack of division and multiplication in the dynamic program (multiplying by 4 using bit shifting instead) would overtake the naive algorithm. But C(m,n) is certainly not interesting enough in its own right for anyone to care 🙂

At this point, we have noticed that we sometimes use induction and assume that many smaller instances of the statement are true. Indeed, why not inductively assume that the statement holds for all smaller n. This would certainly give the prover more tools to work with. Indeed, this technique is sometimes called strong induction, in the sense that we assume a stronger inductive hypothesis when we’re trying to prove p(n+1). It may not be entirely obvious (especially to one well versed in the minutiae of formal logic) that this is actually equivalent to normal induction, but it is. In fact, the concept of “strong” induction is entirely pedagogical in nature. Most working mathematicians will not mention the difference in their proofs.

The last variant we’ll mention about induction is that of transfinite induction. The concept is that if you have any set X which is well-ordered (essentially this means: allowing one to prove induction applies as we did earlier in the post), then we can perform induction its elements. In this way, we can “parameterize” statements by elements of an arbitrary well-ordered set, so that instead of p(n) being a function from natural numbers to mathematical statements, it’s a function from X to mathematical statements. One somewhat common example of when X is something besides natural numbers is when we use the so-called cardinal numbers. We have already seen two distinct infinite cardinal numbers in this series: the cardinality of the integers and the cardinality of the real numbers (indeed, “cardinal number” just means a number which is the cardinality of a set). It turns out that there are many more kinds of cardinal numbers, and you can do induction on them, but this rarely shows up outside of mathematical logic.

And of course, we should close this post on an example of when induction goes wrong. For this we point the reader to our proof gallery, and the false proof that all horses are the same color. It’s quite an amusing joke, and hopefully it will stimulate the reader’s mind to recognize the pitfalls that can occur in a proof by induction.

So those are the basic four proof techniques! Fortunately for the reader, pretty much all proofs presented on this blog follow one of these four techniques. I imagine many of my readers skip over the proofs entirely (if only I could put proofs in animated gifs, with claims illustrated by grumpy cats!). So hopefully, if you have been intimidated or confused by the structure of the proofs on this blog, this will aid you in your future mathematical endeavors.  Butchering an old phrase for the sake of a bad pun, the eating of the pudding is in the proof. 🙂

Until next time!

False Proof – All Horses are the Same Color

Problem: Show that all horses are of the same color.

“Solution”: We will show, by induction, that for any set of n horses, every horse in that set has the same color. Suppose n=1, this is obviously true. Now suppose for all sets of n horses, every horse in the set has the same color. Consider any set H of n+1 horses. We may pick a horse at random, h_1 \in H, and remove it from the set, getting a set of n horses. By the inductive hypothesis, all of the n remaining horses are the same color.

On the other hand, if we remove a different horse h_2 \in H, we again get a set of n horses which are all the same color. Let us call this color “brown,” just to give it a name. In particular, h_1 is brown. But when we removed h_1, we got that all the remaining horses had the same color as h_2. So h_2 must also be brown. Hence, all horses in H are brown.

Explanation: The argument here is valid for almost all n. For large n, it is a very natural argument that follows from the inductive hypothesis. However, it fails for n=2 (and this is the only time it fails). By having only two horses, we cannot conclude that the removed horses have the same color, because there aren’t any horses remaining in H to apply the transitivity of “having the same color as.”

This false proof highlights the danger of neglecting the base case of an inductive argument. Here the true base case was not n=1, but rather n=2. Since the base case is false, we should have prudently stopped our argument there before embarrassing ourselves.