# The Fourier Series – A Primer

## Overview

In this primer we’ll get a first taste of the mathematics that goes into the analysis of sound and images. In the next few primers, we’ll be building the foundation for a number of projects in this domain: extracting features of music for classification, constructing so-called hybrid images, and other image manipulations for machine vision problems (for instance, for use in neural networks or support vector machines; we’re planning on covering these topics in due time as well).

But first we must cover the basics, so let’s begin with the basic ideas about periodic functions. Next time, we’ll move forward to talk about Fourier transforms, and then to the discrete variants of the Fourier transform. But a thorough grounding in this field’s continuous origins will benefit everyone, and the ubiquity of the concepts in engineering applications ensures that future projects will need it too. And so it begins.

## The Bird’s Eye View

The secret of the universe that we will soon make rigorous is that the sine and cosine can be considered the basic building blocks of all functions we care about in the real world. That’s not to say they are the only basic building blocks; with a different perspective we can call other kinds of functions basic building blocks as well. But the sine and cosine are so well understood in modern mathematics that we can milk them for all they’re worth with minimum extra work. And as we’ll see, there’s plenty of milk to go around.

The most rigorous way to state that vague “building block” notion is the following theorem, which we will derive in the sequel. Readers without a strong mathematical background may cringe, but rest assured, the next section and the remainder of this primer require nothing more than calculus and familiarity with complex numbers. We simply state the main idea in full rigor here for the mathematicians. One who understands the content of this theorem may skip this primer entirely, for everything one needs to know about Fourier series is stated there. This may also display to the uninitiated reader the power of more abstract mathematics.

Theorem: The following set of functions forms a complete orthonormal basis for the space $L^2$ of square-integrable functions:

$\displaystyle \left \{ e^{2 \pi i k x} \right \}_{k = - \infty}^{\infty}$

And the projection of a function $f$ onto this basis gives the Fourier series representation for $f$.

Of course, those readers with a background in measure theory and linear algebra will immediately recognize many of the words in this theorem. We don’t intend to cover the basics of measure theory or linear algebra; we won’t define a measure or Lebesgue-integrability, nor will we reiterate the facts about orthogonality we covered in our primer on inner-product spaces. We will say now that the inner products here should be viewed as generalizations of the usual dot product to function spaces, and we will only use the corresponding versions of the usual tasks the dot product can perform. This author prefers to think about these things in algebraic terms, and so most of the important distinctions about series convergence will either fall out immediately from algebraic facts or be neglected.

On the other hand, we will spend some time deriving the formulas from a naive point of view. In this light, many of the computations we perform in this primer will not assume knowledge beyond calculus.

## Periodic Signals (They’re Functions! Just Functions!)

Fourier analysis is generally concerned with the analysis and synthesis of functions. By analysis we mean “the decomposition into easy-to-analyze components,” and by synthesis we mean “the reconstruction from such components.” In the wealth of literature that muddles the subject, everyone seems to have different notation and terminology just for the purpose of confusing innocent and unsuspecting mathematicians. We will take what this author thinks is the simplest route (and it is certainly the mathematically-oriented one), but one always gains an advantage by being fluent in multiple languages, so we clarify our terminology with a few definitions.

Definition: A signal is a function.

For now we will work primarily with continuous functions of one variable, either $\mathbb{R} \to \mathbb{R}$ or $\mathbb{R} \to \mathbb{C}$. This makes sense for functions of time, but many functions have domains that should be interpreted as spatial (after all, sine and cosine are classically defined as spatial concepts). We give an example of such spatial symmetry at the end of this primer, but the knowledgeable mathematician will imagine how this theory could be developed in more general contexts. When we get to image analysis, we will extend these methods to two (or more) dimensions. Moreover, as all of our non-theoretical work will be discrete, we will eventually drop the continuity assumption as well.

Also just for the moment, we want to work in the context of periodic functions. The reader should henceforth associate the name “Fourier series” with periodic functions. Since sine and cosine are periodic, it is clear that finite sums of sines and cosines are also periodic.

Definition: A function $f: \mathbb{R} \to \mathbb{C}$ is $a$-periodic if $f(x+a) = f(x)$ for all $x \in \mathbb{R}$.

This is a very strong requirement of a function, since just by knowing what happens to a function on any interval of length $a$, we can determine what the function does everywhere. At first, we will stick to 1-periodic functions and define the Fourier series just for those. As such, our basic building blocks won’t be $\sin(t), \cos(t)$, but rather $\sin(2\pi t), \cos(2 \pi t)$, since these are 1-periodic. Later, we will (roughly) generalize the Fourier series by letting the period tend to infinity, and arrive at the Fourier transform.

For functions which are only nonzero on some finite interval, we can periodize them to be 1-periodic. Specifically, we scale them horizontally so that they lie on the interval $[0,1]$, and we repeat the same values on every interval. In this way we can construct a large number of toy periodic functions.

## Naive Symbol-Pushing

Now once we get it in our minds that we want to use sines and cosines to build up a signal, we can imagine what such a representation might look like. We might have a sum that looks like this:

$\displaystyle A_0 + \sum_{k = 1}^{n} A_k\sin(2 \pi k t + \varphi_k)$

Indeed, this is the most general possible sum of sines that maintains 1-periodicity: we allow arbitrary phase shifts by $\varphi_k$, and arbitrary amplitude changes via $A_k$. We have an arbitrary vertical shifting constant $A_0$. We also multiply the frequency by $k$ at each step, but it is easy to see that the function is still 1-periodic, although it may also be, e.g. 1/2-periodic. In general, since we are increasing the periods by integer factors the period of the sum is determined by the longest of the periods, which is 1. Test for yourself, for example, the period of $\sin(2\pi t) + \sin(4 \pi t)$.

Before we continue, the reader should note that we don’t yet know if such representations exist! We are just supposing initially that they do, and then deriving what the coefficients would look like.

To get rid of the phase shifts $\varphi_k$, and introduce cosines, we use the angle sum formula on the above quantity to get

$\displaystyle A_0 + \sum_{k = 1}^n A_k \sin(2 \pi k t) \cos(\varphi_k) + A_k \cos(2 \pi k t) \sin(\varphi_k)$

Now the cosines and sines of $\varphi_k$ can be absorbed into the constant $A_k$, which we rename $a_k$ and $b_k$ as follows:

$\displaystyle A_0 + \sum_{k =1}^{n} a_k \sin(2 \pi k t) + b_k \cos(2 \pi k t)$

Note that we will have another method to determine the necessary coefficients later, so we can effectively ignore how these coefficients change. Next, we note the following elementary identities from complex analysis:

$\displaystyle \cos(2 \pi k t) = \frac{e^{2 \pi i k t} + e^{-2 \pi i k t}}{2}$
$\displaystyle \sin(2 \pi k t) = \frac{e^{2 \pi i k t} - e^{-2 \pi i k t}}{2i}$

Now we can rewrite the whole sum in terms of complex exponentials as follows. Note that we change the indices to allow negative $k$, we absorb $A_0$ into the $k = 0$ term (for which $e^{0} = 1$), and we again shuffle the constants around.

$\displaystyle \sum_{k = -n}^n c_k e^{2 \pi i k t}$

At this point, we must allow for the $c_k$ to be arbitrary complex numbers, since in the identity for sine above we divide by $2i$. Also, though we leave out the details, we see that the $c_k$ satisfy an important property that results in this sum being real-valued for any $t$. Namely, $c_{-k} = \overline{c_k}$, where the bar represents the complex conjugate.

We’ve made a lot of progress, but whereas at the beginning we didn’t know what the $A_k$ should be in terms of $f$ alone, here we still don’t know how to compute $c_k$. To find out, let’s use a more or less random (inspired) trick: integrate.

Suppose that $f$ is actually equal to this sum:

$\displaystyle f(t) = \sum_{k = -n}^n c_k e^{2 \pi i k t}$

Let us isolate the variable $c_m$ by first multiplying through by $e^{-2 \pi i m t}$ (where $m$ may also be negative), and then move the remaining terms to the other side of the equation.

$\displaystyle c_m = f(t)e^{-2 \pi i m t} - \sum_{k \neq m} c_k e^{2 \pi i (k-m) t}$

Integrating both sides of the equation on $[0,1]$ with respect to $t$, something magical happens. First, note that if $k \neq m$, the integral

$\displaystyle \int_0^1 c_ke^{2 \pi i (k-m) t} dt$

is actually zero! Moreover, the integral $\int_0^1 c_m dt = c_m(1 - 0) = c_m$. Hence, this big mess simplifies to

$\displaystyle c_m = \int_0^1 f(t)e^{-2 \pi i m t} dt$

And now the task of finding the coefficients is simply reduced to integration. Very tidy.

Those readers with some analysis background will recognize this immediately as the $L^2$ inner product. Specifically, the inner product is:

$\displaystyle \left \langle f,g \right \rangle = \int_0^1 f(t)\overline{g(t)}dt$

In fact, the process we went through is how one derives what the appropriate inner product for $L^2$ should be. With respect the this inner product, the exponentials $e^{2 \pi i k t}$ do form an orthonormal set and this is trivial to verify once we’ve found the inner product. As we will say in a second, the coefficients of the Fourier series will be determined precisely by the projections of $f$ onto the complex exponentials, as is usual with orthonormal bases of inner product spaces.

We will provide an example of finding such coefficients in due time, but first we have bigger concerns.

## The Fourier Series, and Convergence Hijinx

Now recall that our original formula was a finite sum of sines. In general, not all functions can be represented by a finite sum of complex exponentials. For instance, take this square wave:

A square wave, 1-periodic, but discontinuous at k/2 for all integers k.

This function is the characteristic function of the set $\cup_{k \in \mathbb{Z}} [k, k+1/2]$. No matter how many terms we use, if the sum is finite then the result will be continuous, and this function is discontinuous. So we can’t represent it.

“Ah, what about continuous functions?” you say, “Surely if everything is continuous our problems will be solved.” Alas, if only mathematics were so simple. Here is an example of a continuous function which still cannot be represented: a triangle wave.

A triangle wave. It’s continuous, but not smooth.

This function can be described as the periodization of the piecewise function defined by $1-2x$ and $2x$ on the intervals $[0, 1/2], [1/2,1]$, respectively.

Unfortunately, these “sharp corners” prevent any finite sum from giving an exact representation. Indeed, this function is not differentiable at those points, while a finite sum of differentiable exponentials is.

More generally, this is a problem if the function we are trying to analyze is not smooth; that is, if it is not infinitely differentiable at all points in its domain. Since a complex exponential is smooth, and the sum of smooth functions is smooth, we see that this must be the case.

Indeed, the only way to avoid this issue is to let $n$ go to infinity. So let us make the natural definitions, and arrive at the true Fourier series:

Definition: The $k$-th Fourier coefficient of a function $f$, denoted $\widehat{f}(k)$, is the integral

$\displaystyle \widehat{f}(k) = \left \langle f, e^{2 \pi i k t} \right \rangle = \int_{0}^1 f(t) e^{-2 \pi i k t} dt$

The Fourier series of $f$ is

$\displaystyle \sum_{k = -\infty}^{\infty} \widehat{f}(k) e^{2 \pi i k t}$

and is equal to $f(t)$ for all $t \in \mathbb{R}$.

At first, we have to wonder what class of functions we can use this on. Indeed, this integral is not finite for some wild functions. This leads us to restrict Fourier series to functions in $L^2$. That is, the only functions we can apply this to must have finite $L^2$ norm. In the simplest terms, $f$ must be square-integrable, so that the integral

$\displaystyle \int_0^1 \left | f(t) \right |^2 dt$

is finite. As it turns out, $L^2$ is a huge class of functions, and every function we might want to analyze in the real world is (or can be made to be) in $L^2$. Then, of course, the pointwise equality of a function with its Fourier series (pointwise convergence) is guaranteed by the fact that the complex exponentials form an orthonormal basis for $L^2$.

But we have other convergence concerns. Specifically, we want it to be the case that when $k$ is very far from 0, then $\widehat{f}(k)$ contributes little to the representation of the function. The real way to say that this happens is by using the $L^2$ norm again, in the sense that two functions are “close” if the following integral is small in absolute value:

$\displaystyle \int_0^1 \left | f(t) - g(t) \right |^2 dt$

Indeed, if we let $g_n(t)$ be the termination of the Fourier series at $n$ (i.e. $-n \leq k \leq n$), then as $n \to \infty$, the above integral tends to zero. This is called convergence in $L^2$, and it’s extremely important for applications. We will never be able to compute the true Fourier series in practice, so we have to stop at some sufficiently large $n$. We want the theoretical guarantee that our approximation only gets better for picking large $n$. The proof of this fact is again given by our basis: the complex exponentials form a complete basis with respect to the $L^2$ norm.

Moreover, if the original function $f$ is continuous then we get uniform convergence. That is, the quality of the approximation will not depend on $t$, but only on our choice of a terminating $n$.

There is a wealth of other results on the convergence of Fourier series, and rightly so, by how widely used they are. One particularly interesting result we note here is a counterexample: there are continuous and even integrable functions (but not square-integrable) for which the Fourier series diverges almost everywhere.

## Other Bases, and an Application to Heat

As we said before, there is nothing special about using the complex exponentials as our orthonormal basis of $L^2$. More generally, we call $\left \{ e_n \right \}$Hilbert basis for $L^2$ if it forms an orthonormal basis and is complete with respect to the $L^2$ norm.

One can define an analogous series expansion for a function with respect to any Hilbert basis. While we leave out many of the construction details for a later date, one can use, for example, Chebyshev polynomials or Hermite polynomials. This idea is hence generalized to the field of “wavelet analysis” and “wavelet transforms,” of which the Fourier variety is a special case (now we’re speaking quite roughly; there are details this author isn’t familiar with at the time of this writing).

Before we finish, we present an example where the Fourier series is used to solve the heat equation on a circular ring. We choose this problem because historically it was the motivating problem behind the development of these ideas.

In general, the heat equation applies to some region $R$ in space, and you start with an initial distribution of heat $f(x)$, where $x$ is a vector with the same dimension as $R$. The heat equation dictates how heat dissipates over time.

Periodicity enters into the discussion because of our region: $R$ is a one-dimensional circle, and so $x$ is just the value $x \in [0, 1]$ parameterizing the angle $2 \pi x$. Let $u(x,t)$ be the temperature at position $x$ at time $t$. As we just said, $u$ is 1-periodic in $x$, as is $f$. The two are related by $u(x,0) = f(x)$.

The important consideration is that the symmetry of the circle has consequences in how the heat dissipates.

Now let us write the Fourier series for $u$.

$\displaystyle u(x,t) = \sum_{k = -\infty}^{\infty} c_k(t) e^{2 \pi i k x}$

Where the dependence on time $t$ comes into play in $c_k$:

$\displaystyle c_k(t) = \int_0^1 u(x,t)e^{-2 \pi i k x} dx$

Now the mystery here is evaluating these $c_k$. We want to find the coefficients in terms of the initial temperature distribution $f$. The magical step here is that $u$ satisfies the (one-dimensional) heat equation: the partial derivative with respect to time of $u$ is directly proportional to the second partial spatial derivative. In symbols,

$u_t = au_{xx}$

Where the constant $a$ depends on the physical properties of the region. See this page for a general derivation. This equation also called the diffusion equation, because it not only governs the dissipation of heat, but the diffusion of lots of quantities over time (e.g., charge in a wire).

Without loss of generality, let’s let $a = 1/2$. Let’s plug the Fourier series for $u(x,t)$ into the heat equation and see what happens. First,

$\displaystyle u_t = \sum_{k = -\infty}^{\infty} c_k'(t) e^{2 \pi i k x}$
$\displaystyle u_{xx} = \sum_{k = -\infty}^{\infty} c_k(t)(-4 \pi ^2 k^2) e^{2 \pi i k x}$

And plugging in, we have

$\displaystyle \sum_{k = -\infty}^{\infty} c_k'(t) e^{2 \pi i k x} = \sum_{k = -\infty}^{\infty} c_k(t)(-2 \pi^2 k^2)e^{2 \pi i k x}$

This simply says that we equate the coefficients of each term $e^{2 \pi i k x}$, giving

$\displaystyle c_k'(t) = -2 \pi^2 k^2 c_k(t)$

But this is a wonderful thing, because this is an ordinary differential equation, and we can solve it by elementary means. Using the usual logarithmic trick, we have

$\displaystyle c_k(t) = c_k(0) e^{- 2 \pi^2k^2 t}$

So now $c_k(0)$ is just $\int_0^1 u(x,0)e^{-2 \pi i k x} dx$ and $u(x,0) = f(x)$. From a different perspective, this is just the $k$-th Fourier coefficient of $f$. Summing it all up, we now have a very explicit description of $u(x,t)$ in which the dependence on time and space is obvious:

$\displaystyle u(x,t) = \sum_{k = -\infty}^{\infty} \widehat{f}(k)e^{-2 \pi^2 k^2 t}e^{2 \pi i k x}$

And this solves the problem.

To convince you of some of the virtues of this representation, we can see that as $t$ tends to infinity, the first exponential goes to zero in every term, so that the whole expression goes to zero (regardless of $k, k^2$ is always positive). That is, no matter what the initial heat distribution is, after a long enough time all of the heat at any point will dissipate.

So there you have it! Our main content posts in the future will use Fourier series (and more heavily, Fourier transforms) in analyzing things like music files, images, and other digital quantities.

Until then!

# Kolmogorov Complexity – A Primer

## The Complexity of Things

Previously on this blog (quite a while ago), we’ve investigated some simple ideas of using randomness in artistic design (psychedelic art, and earlier randomized css designs), and measuring the complexity of such constructions. Here we intend to give a more thorough and rigorous introduction to the study of the complexity of strings. This naturally falls into the realm of computability theory and complexity theory, and so we refer the novice reader to our other primers on the subject (Determinism and Finite Automata, Turing Machines, and Complexity Classes; but Turing machines will be the most critical to this discussion).

## The Problem with Randomness

What we would really love to do is be able to look at a string of binary digits and decide how “random” it is. On one hand, this is easy to do at a high level. A child would have no difficulty deciding which of the following two strings is more random:

10101010101010101010101010101010101010101010101010
00011101001000101101001000101111010100000100111101

And yet, by the immutable laws of probability, each string has an equal chance ($2^{-50}$) in being chosen at random from all sequences of 50 binary digits. So in a sense, the huge body of mathematics underlying probability has already failed us at this basic juncture because we cannot speak of how random one particular outcome of an experiment is. We need a new notion that overcomes this difficulty.

Definition: The Kolmogorov complexity of a string $w$, denoted $K(w)$ is the length of the shortest program which outputs $w$ given no input.

While this definition is not rigorous enough to be of any use (we will reformulate it later), we can easily see why the first of the two strings above is less random. We can write a very short Python program that outputs it:

print "01" * 25

On the other hand, it is not hard to see that the shortest program that produces the second string is

print "00011101001000101101001000101111010100000100111101"

The dutiful reader will cry out in protest. How do you know that’s the shortest program? Why restrict to Python? This whole discussion is so arbitrary!

Indeed, this is probably not the strictly shortest Python program that prints out the string. In the following we’ll work entirely in binary, so the picky reader should interpret this as a print command referring to a block of binary memory. We will reify these ideas in full rigor shortly, but first let us continue with this naivete to make the coming definitions a bit easier to parse.

If we abstract away the lengths of these strings, we see that the length of the first program is $O(1) + \log(n)$, since we need $\log(n)$ bits to represent the number $n/2$ in the string product. On the other hand, the second string has a program length of $O(1) + n$, as we require the entire output string as program text.

This intuition would lead us to define a sequence of length $n$ to be random if it has Kolmogorov complexity at least $n$.

We can extend this idea to talk about relative complexity. Specifically, we can speak of Python programs which accept input, and compute their output based on that input. For instance, the first of the two strings above has the program:

n = input()
print "01" * n/2

With respect to the input “50″, we see that the first string has constant complexity (indeed, this is also true of many numbers, such as 25). In other words, the string “50″ contains a lot of information about the string we wish to generate (it’s length).

On the other hand, the same technique still won’t work for the second string. Even though it has length 50, that is not enough information to determine the contents of the string, which vary wildly. So the shortest program is still (probably) the one which prints out the string verbatim.

In the future, we plan to revisit the idea of relative complexity in the context of machine learning and classification; briefly, two items are similar if one has low complexity relative to the other. But for now, let us turn to the more precise definitions.

## Actual Kolmogorov Complexity

We keep saying of the second string above that the shortest program is probably the one which prints the string verbatim. In fact, short of testing every single python program of shorter length, we will never know if this is true! Even if we did, our following definitions will make that discovery irrelevant. More generally, it’s an important fact that Kolmogorov complexity is an uncomputable function. That is, there is no Turing machine $M$ which accepts as input a word $w$ and produces its Kolmogorov complexity as output. In order to prove such miraculous things, we need to formalize the discussion in the context of Turing machines.

Let us fix a universal programming language $L$ and speak of the Kolmogorov complexity with respect to $L$:

Definition: The Kolmogorov complexity of a string $w$ with respect to $L$, denoted $K_L(w)$ is the shortest program written in the language $L$ which produces $w$ as output. The conditional Kolmogorov complexity with respect to a string $x$, denoted $K_L(w | x)$ (spoken $w$ given $x$, as in probability theory), is the length of the shortest program which, when given $x$ as input, outputs $w$.

Before we can prove that this definition is independent of $L$ (for all intents and purposes), we need a small lemma, which we have essentially proved already:

Lemma: For any strings $w,x$ and any language $L$, we have $K_L(w | x) \leq |w| + c$ for some constant $c$ independent of $w, x$, and $K_L(w) \leq |w| + c'$ for some constant $c'$ independent of $w$.

Proof. The program which trivially outputs the desired string has length $|w| + c$, for whatever constant number of letters $c$ is required to dictate that a string be given as output. This is clearly independent of the string and any input. $\square$

It is not hard to see that this definition is invariant under a choice of language $L$ up to a constant factor. In particular, let $w$ be a string and fix two languages $L, L'$. As long as both languages are universal, in the sense that they can simulate a universal Turing machine, we can relate the Kolmogorov complexity of $w$ with respect to both languages. Specifically, one can write an interpreter for $L$ in the language of $L'$ and vice versa. We saw a short foray into designing interpreters (before this author realized he had no interesting direction to take the project) with the Chai programming language. Although we didn’t get far enough to make Chai universal, the reader may rest assured that it is possible, and that the resulting Racket program has finite length.

If we let $p$ be the shortest program written in $L$ that outputs $w$ given $x$, and $i$ be the interpreter for $L$ written in $L'$, then we can compute $w$ given the input $\left \langle p, x \right \rangle$ by way of the interpreter $i$. In other words, $K_L(w | px) \leq |i|$.

$K_{L'}(w | x) \leq |p| + c + |i| = K_L(w | x) + c + |i| = K_L(w | x) + O(1)$

Another easy way to convince oneself of this is to imagine one knows the language $L'$. Then the program would be something akin to:

input x
run the interpreter i on the program p with input x

Our inequalities above just describe the length of this program: we require the entire interpreter $i$, and the entire program $p$, to be part of the program text. After that, it’s just whatever fixed constant amount of code is required to initialize the interpreter on that input.

We call this result the invariance property of Kolmogorov complexity. And with this under our belt, we may speak of the Kolmogorov complexity of a string. We willingly ignore the additive constant difference which depends on the language chosen, and we may safely work exclusively in the context of some fixed universal Turing machine (or, say, Python, C, Racket, or Whitespace; pick your favorite). We will do this henceforth, denoting the Kolmogorov complexity of a string $K(w)$.

## Some Basic Facts

The two basic facts we will work with are the following:

• There are strings of arbitrarily large Kolmogorov complexity.
• Most strings have high complexity.

And we will use these facts to prove that $K$ is uncomputable.

The first point is that for any $n$, there is a string $w$ such that $K(w) \geq n$. To see this, we need only count the number of strings of smaller length. For those of us familiar with typical combinatorics, it’s just the Pigeonhole Principle applied to all strings of smaller length.

Specifically, there is at most one string of length zero, so there can only be one string with Kolmogorov complexity zero; i.e., there is only one program of length zero, and it can only have one output. Similarly, there are two strings of length one, so there are only two strings with Kolmogorov complexity equal to one. More generally, there are $2^i$ strings of length $i$, so there are at most $\sum_{i = 0}^{n-1} 2^i$ strings of Kolmogorov complexity less than $n$. However, as we have seen before, this sum is $2^n - 1$. So there are too many strings of length $n$ and not enough of smaller length, implying that at least one string of length $n$ has Kolmogorov complexity at least $n$.

The second point is simply viewing the above proof from a different angle: at least 1 string has complexity greater than $n$, more than half of the $2^n$ strings have complexity greater than $n-1$, more than three quarters of the strings have complexity greater than $n-2$, etc.

Rigorously, we will prove that the probability of picking a string with $K(x) \leq n - c - 1$ is smaller than $1/2^c$. Letting $S$ be the set of all such strings, we have an injection $S \hookrightarrow$ the set of all strings of length less than $n - c- 1$, and there are $2^{n-c} - 1$ such strings, so $|S| \leq 2^{n-c} - 1$, giving the inequality:

$\displaystyle \frac{|S|}{2^n} \leq \frac{2^{n-c} - 1}{2^n} = \frac{1}{2^c} - \frac{1}{2^n} < \frac{1}{2^c}$

In other words, the probability that a randomly chosen string $w$ of length $n$ has $K(w) \geq n-c$ is at least $1 - 1/2^c$.

The strings of high Kolmogorov complexity have special names.

Definition: We call a string $w$ such that $K(w) \geq |w|$Kolmogorov random string, or an incompressible string.

This name makes sense for two reasons. First, as we’ve already mentioned, randomly chosen strings are almost always Kolmogorov random strings, so the name “random” is appropriate. Second, Kolmogorov complexity is essentially the ideal compression technique. In a sense, strings with high Kolmogorov complexity can’t be described in any shorter language; such a language would necessarily correspond to a program that decodes the language, and if the compression is small, so is the program that decompresses it (these claims are informal, and asymptotic).

## Uncomputability

We will now prove the main theorem of this primer, that Kolmogorov complexity is uncomputable.

Theorem: The Kolmogorov complexity function $w \mapsto K(w)$ is uncomputable.

Proof. Suppose to the contrary that $K$ is computable, and that $M$ is a Turing machine which computes it. We will construct a new Turing machine $M'$ which computes strings of high complexity, but $M'$ will have a short description, giving a contradiction.

Specifically, $M'$ iterates over the set of all binary strings in lexicographic order. For each such string $w$, it computes $K(w)$, halting once it finds a $w$ such that $K(w) \geq |w| = n$. Then we have the following inequality:

$n \leq K(w) \leq \left | \left \langle M', n \right \rangle \right | + c$

Here, the angle bracket notation represents a description of the tuple $(M', n)$, and $c$ is a constant depending only on $M'$, which is fixed. The reason the inequality holds is just the invariance theorem: $\left \langle M', n \right \rangle$ is a description of $w$ in the language of Turing machines. In other words, the universal Turing machine which simulates $M'$ on $n$ will output $w$, so the Kolmogorov complexity of $w$ is bounded by the length of this description (plus a constant).

Then the length of $\left \langle M', n \right \rangle$ is at most $\log(n) + \left | \left \langle M' \right \rangle \right | + c'$ for some constant $c'$, and this is in turn $\log(n) + c''$ for some constant $c''$ (as the description of $M'$ is constant). This gives the inequality

$n \leq \log(n) + c''$

But since $\log(n) = o(n)$, we may pick sufficiently large $n$ to achieve a contradiction. $\square$

We can interpret this philosophically: it is impossible to tell exactly how random something is. But perhaps more importantly, this is a genuinely different proof of uncomputability from our proof of the undecidability of the Halting problem. Up until now, the only way we could prove something is uncomputable is to reduce it to the Halting problem. Indeed, this lends itself nicely to many new kinds of undecidability-type theorems, as we will see below.

The reader may ask at this point, “Why bother talking about something we can’t even compute!?” Indeed, one might think that due to its uncomputability, Kolmogorov complexity can only provide insight into theoretical matters of no practical concern. In fact, there are many practical applications of Kolmogorov complexity, but in practice one usually gives a gross upper bound by use of the many industry-strength compression algorithms out there, such as Huffman codes. Our goal after this primer will be to convince you of its remarkable applicability, despite its uncomputability.

## Consequences of Uncomputability and Incompressibility

One immediate consequence of the existence of incompressible strings is the following: it is impossible to write a perfect lossless compression algorithm. No matter how crafty one might be, there will always be strings that cannot be compressed.

But there are a number of other, more unexpected places that Kolmogorov complexity fills a niche. In computational complexity, for instance, one can give a lower bound on the amount of time it takes a single-tape Turing machine to simulate a two-tape Turing machine. Also in the realm of lower bounds, one can prove that an incompressible string $w$ of length $2^n$, which can be interpreted as a boolean function on $n$ variables, requires $\Omega(2^n/n)$ gates to express as a circuit.

It also appears outside of theoretical computer science. In, for instance, the study of entropy in dynamical systems (specifically, thermodynamics), one can make the following pictorial remark:

In fact, the author of this image, Scott Aaronson (whom we’ve seen before in our exploration of the Complexity Zoo) even proposes an empirical test of this fact: simulate a discretized coffee cup and try to compress the data at each step, graphing the resulting lengths to see the trends. This even sounds like a good project for this blog!

The applications don’t end there, though. Researchers have used the theory of Kolmogorov complexity to tackle problems in machine learning, clustering, and classification. Potentially, any subject which is concerned with relating pieces of information could benefit from a Kolmogorov-theoretic analysis.

Finally, we give a proof that the existence of Kolmogorov complexity provides an infinite number of mathematical statements which are unprovable.

Theorem: Fix a formalization of mathematics in which the following three conditions hold:

• If a statement is provable then it is true.
• Given a proof and a statement, it is decidable whether the proof proves the statement.
• For every binary string $x$ and integer $k$, we can construct a statement $S(x,k)$ which is logically equivalent to “the Kolmogorov complexity of $x$ is at least $k$“.

Then there is some constant $t$ for which all statements $S(x,k)$ with $k > t$ are unprovable.

Proof. We construct an algorithm to find such proofs as follows:

On an input k,
Set m equal to 1.
Loop:
for all strings x of length at most m:
for all strings P of length at most m:
if P is a proof of S(x,k), output x and halt
m = m+1

Suppose to the contrary that for all $k$ there is an $x$ for which the statement $S(x,k)$ is provable. It is easy to see that the algorithm above will find such an $x$. On the other hand, for all such proofs $P$, we have the following inequality:

$k \leq K(x) \leq \left | \left \langle M, k \right \rangle \right | + c = \log(k) + c'$

Indeed, this algorithm is a description of $x$, so its length gives a bound on the complexity of $x$. Just as in the proof of the uncomputability of $K$, this inequality can only hold for finitely many $k$, a contradiction. $\square$

Note that this is a very different unprovability-type proof from Gödel’s Incompleteness Theorem. We can now construct with arbitrarily high probability arbitrarily many unprovable statements.

Look forward to our future posts on the applications of Kolmogorov complexity! We intend to explore some compression algorithms, and to use such algorithms to explore clustering problems, specifically for music recommendations.

Until then!

# A Poll for the Next Topic

I’m always looking for new ideas to write about on this blog. In fact, almost every day I stumble upon a Wikipedia page for something I never knew about before, and want to start researching it right away! So give a response to what sounds the most interesting, or better yet what you want to read about that isn’t on the list.

I’d also like to know what parts of the blog you readers enjoy the most. So far my only gauge of that is from page hits and comments. It seems that my Python and Racket primers garner the most attention, and while I haven’t found any good false proofs recently, those are a big hit. So leave a comment with some feedback, and this blog will become all the better for it!

# Optimally Stacking the Deck – Texas Hold ‘Em

Main Theorem: There exist optimal stackings for standard two-player Texas Hold ‘Em.

## A Puzzle is Solved (and then some!)

It’s been quite a while since we first formulated the idea of an optimal stacking. In the mean time, we’ve gotten distracted with graduate school, preliminary exams, and the host of other interesting projects that have been going on here at Math ∩ Programming. And so months later, after traversing the homotopic hills of topology and projective plains of algebra, we’ve finally found time to solve the problem. And we found quite a bit more than we expected! But first, let’s recap the ideas from our original post.

For a reminder of the rules of Texas Hold ‘Em, here’s a silly tutorial video from PokerStars.com. For the sake of this post, the betting rules will not matter.

## Optimal Stackings

In card games, especially gambling games, trust is rarely given freely. Instead, we incorporate rituals into standard play that assure fairness. One of the most common such rituals is “cutting the deck.” The player who shuffles the deck passes the deck to a second player, who cuts the deck in half, placing the bottom half above the top half. Assuming the two players are not colluding and there is no sleight of hand, this ritual absolves the dealer of any guilt. Even if the dealer had stacked the deck to deal himself favorable cards, the randomness of the cut will necessarily ruin his efforts. Moreover, the cutter is often chosen to be the player to the dealer’s right (the last one dealt, in most games) to make it all the more difficult for two players to collude.

An optimal stacking circumvents the tacit fairness assumed after a deck is cut. In particular, we search for an ordering of the cards in the deck which is so mischievous that no matter where it is cut, a specific player always wins.

This author doesn’t condone cheating in card games, nor does he plan to use the discoveries in this article for anything more than a parlor trick. Instead, the existence or nonexistence of an optimal stacking is an interesting combinatorial property of card games, and it appears to be a measure of complexity. As we saw last time, simple games like Blind Man’s Bluff (sometimes called Indian Poker) and Kuhn Poker do not have optimal stackings. Adding complexity to these games, we invented Kicsi Poker and discovered it has 11 total distinct optimal stackings (up to cyclic permutations of the deck). Moreover, depending on whether a card was “burned” during the deal, one player had significantly more optimal stackings than the other player (nine for player 1, and two for player 2).

But nobody plays Kicsi Poker, so the true goal is to determine whether Texas Hold ‘Em has optimal stackings. Because we found optimal stackings for Kicsi Poker after adding complexity, we reckoned that adding more complexity would not rule out the possibility of optimal stackings. Hence we supplied the following conjecture, which is now a theorem:

Theorem: There exist optimal stackings for standard two-player Texas Hold ‘Em.

In fact, we find much more than this, but we should formalize the notion a bit further and describe our method before constructing such a stacking. The remainder of this section will be abstract and seemingly unnecessary for the casual reader. We hold the contention that a proper mathematical analysis should be independent of the real world, and with the right definitions we can logically extend optimal stackings to other situations.

Definition: A cutting permutation $\sigma$ of a list $(1, 2, \dots, n)$ is a permutation in $S_n$ with the cycle form:

$(1\ 2\ \dots\ n)^k$

and for a fixed $0 \leq k < n$, we call $\sigma$ the cut at the $k$-th position.

If the reader is not familiar with the symmetric groups, this is simply a formalization of the intuitive idea of a cut. For example, to “cut” the list $(1, 2, 3, 4, 5)$ at the third position is simply the permutation with cycle form $(1\ 3\ 5\ 2\ 4)$. Applying this results in the list $(4,5,1,2,3)$, as expected. Moreover, every cut can be achieved by iterating the process of putting the top card on the bottom of the deck. Hence, the set of all cutting permutations is the cyclic group $\mathbb{Z}/52\mathbb{Z}$.

Since any list imposes an ordering on its contents, we can apply cutting permutations to any list.

Definition: A cut of a list $(c_1, \dots , c_n)$ is the list $(c_{\sigma(1)}, \dots, c_{\sigma(n)})$ for some cutting permutation $\sigma$. If the list is denoted $l$, we denote the cut $\sigma(l)$.

Specifically to card games, we speak of cutting a deck as the process of replacing a list of cards with its cut. We can now define an optimal stacking.

Definition: Fix a game $G$ played with a set of $k$ cards and $n$ players. We say that $G$ has an optimal stacking for player $i$ if there exists an ordering of the cards $D = (c_1, \dots, c_k)$ such that for every cutting permutation $\sigma$, player $i$ can force a win when $\sigma(D)$ is dealt.

Moreover, a game is said to simply have an optimal stacking if it has one for any player. We do not yet know of games which have optimal stackings for some players but not others, and we conjecture that no natural games do (that would make the game unfair for some players, although this author has always found asymmetric games interesting). It may also benefit one to assume in general that all players are infinitely rational, a standard assumption in Game Theory.

Note that we do not want to assume the cheating player knows the order of the deck ahead of time. Instead, an optimal stacking should allow the player to force a win simply by logical play and the knowledge that the stacking is optimal. Specifically, for a game where players have individual hands, the cheating player should not need to know the contents of the other players’ hands in order to win. In Texas Hold ‘Em this becomes a null issue, since if a player knows he is being dealt an optimal stacking, his hand will always end up winning; there is no reason for him to fold, even if his chances seem slim to none by usual probabilistic analysis.

Before we get back to Texas Hold ‘Em, we take a moment to prove a nice fact.

Proposition: The (slightly simplified) game of Hearts does not have an optimal stacking.

Proof. First, we must clarify that a “win” in Hearts is to either collect 0 points in a round or shoot the moon, and such a score must be forced in order to have an optimal stacking. We allow no passing of cards. Moreover, we restrict our game of Hearts to a single round, so that the winner of the game is the one who takes the smallest number of points. Hence, if one cannot shoot the moon, one must try to minimize the number of points taken. In particular, there is no reason for one player to “take one for the team” by trying to stop another player from shooting the moon unless the first player can guarantee having the smallest number of points at the end.

The crux of the proof is that all cards in the deck are dealt before play begins. Suppose first that the goal is simply to collect 0 points. Indeed, suppose that some stacking $D$ wins for player 1. Then a cut at position 1 shifts all of the hands right by one player, so then player 4 can guarantee collecting 0 points. If still player 1 can guarantee collecting zero points, then instead a cut at position 2 gives such guarantees to players 3 and 4. At this point, player 1 can still guarantee taking 0 points, then player 2 can shoot the moon, hence giving all other players 26 points.

If, on the other hand, player 1 can guarantee shooting the moon, the same argument shows that a cut at position 1 gives that guarantee to player 4. This is a contradiction since two players cannot shoot the moon simultaneously, nor can one shoot the moon while the other takes zero points. $\square$

Even extending the definition of a win to “collecting the minimum number of points” does not change the existence of an optimal stacking. The same proof applies, along with the knowledge that 26 (the total number of points) is not divisible by 4 (the number of players), or more coarsely that the Queen of Spades already represents 13 points, and can only be taken by one player.

One aspect of this proof contradicts the claim that the existence of an optimal stacking is a measure of complexity. In particular, Hearts is considered a rather complex game (certainly more complex than Kicsi Poker), and yet it does not have an optimal stacking. Moreover, with a suitable definition of a “win,” we can likely extend this proof to any game which deals out the entire deck to all players before starting (and where all players can see all their cards at all times).

While we do not have a formal rebuttal, it may instead be the case that the existence of an optimal stacking measures the ability for a game to diverge. In other words, in a game like Texas Hold ‘Em, one can start with a poor hand and later improve it. On the other hand, it could simply be that Texas Hold ‘Em has a complex deal: the five community cards, along with the dead cards burned in between each round, give the game enough “space” to allow for an optimal stacking.

But enough big-picture speculation. It’s time for the real prize.

## Optimal Stackings in Texas Hold ‘Em

Before we get into the program, we should give an example of an optimal stacking, and hence a proof of the main theorem. Here it is:

7s9c4dKd8hQd6d3sKs5h2hAs4s2sTs6sQhJd3h4cTd3dKc

p1     p2      community       p1 hand     p2 hand    winner
Th7c | 5dAd | Qs Jc 4h 8c Tc | One Pair  | High Card | 1
5dAd | 7c3c | Jc 4h 6c 9d 6h | One Pair  | One Pair  | 1
7c3c | AdQs | 4h 6c 8c Tc Qc | Flush     | One Pair  | 1
AdQs | 3cJc | 6c 8c 9d 6h 8d | Two Pair  | Two Pair  | 1
3cJc | Qs4h | 8c 9d Tc Qc 9s | Flush     | Two Pair  | 1
Qs4h | Jc6c | 9d Tc 6h 8d Jh | Straight  | Two Pair  | 1
Jc6c | 4h8c | Tc 6h Qc 9s 5c | Flush     | High Card | 1
4h8c | 6c9d | 6h Qc 8d Jh 7d | One Pair  | One Pair  | 1
6c9d | 8cTc | Qc 8d 9s 5c Ah | One Pair  | One Pair  | 1
8cTc | 9d6h | 8d 9s Jh 7d Ac | Straight  | One Pair  | 1
9d6h | TcQc | 9s Jh 5c Ah 9h | Trips     | One Pair  | 1
TcQc | 6h8d | Jh 5c 7d Ac 2c | Flush     | High Card | 1
6h8d | Qc9s | 5c 7d Ah 9h Kh | Straight  | One Pair  | 1
Qc9s | 8dJh | 7d Ah Ac 2c 5s | One Pair  | One Pair  | 1
8dJh | 9s5c | Ah Ac 9h Kh Js | Two Pair  | Two Pair  | 1
9s5c | Jh7d | Ac 9h 2c 5s 8s | Two Pair  | High Card | 1
Jh7d | 5cAh | 9h 2c Kh Js 7h | Two Pair  | High Card | 1
5cAh | 7dAc | 2c Kh 5s 8s 2d | Two Pair  | One Pair  | 1
7dAc | Ah9h | Kh 5s Js 7h 7s | Trips     | One Pair  | 1
Ah9h | Ac2c | 5s Js 8s 2d 9c | One Pair  | One Pair  | 1
Ac2c | 9hKh | Js 8s 7h 7s 4d | One Pair  | One Pair  | 1
9hKh | 2c5s | 8s 7h 2d 9c Kd | Two Pair  | One Pair  | 1
2c5s | KhJs | 7h 2d 7s 4d 8h | Two Pair  | One Pair  | 1
KhJs | 5s8s | 2d 7s 9c Kd Qd | One Pair  | High Card | 1
5s8s | Js7h | 7s 9c 4d 8h 6d | Straight  | One Pair  | 1
Js7h | 8s2d | 9c 4d Kd Qd 3s | High Card | High Card | 1
8s2d | 7h7s | 4d Kd 8h 6d Ks | Two Pair  | Two Pair  | 1
7h7s | 2d9c | Kd 8h Qd 3s 5h | One Pair  | High Card | 1
2d9c | 7s4d | 8h Qd 6d Ks 2h | One Pair  | High Card | 1
7s4d | 9cKd | Qd 6d 3s 5h As | Straight  | High Card | 1
9cKd | 4d8h | 6d 3s Ks 2h 4s | One Pair  | One Pair  | 1
4d8h | KdQd | 3s Ks 5h As 2s | Straight  | One Pair  | 1
KdQd | 8h6d | Ks 5h 2h 4s Ts | One Pair  | High Card | 1
8h6d | Qd3s | 5h 2h As 2s 6s | Two Pair  | One Pair  | 1
Qd3s | 6dKs | 2h As 4s Ts Qh | One Pair  | High Card | 1
6dKs | 3s5h | As 4s 2s 6s Jd | Flush     | Flush     | 1
3s5h | Ks2h | 4s 2s Ts Qh 3h | One Pair  | One Pair  | 1
Ks2h | 5hAs | 2s Ts 6s Jd 4c | One Pair  | High Card | 1
5hAs | 2h4s | Ts 6s Qh 3h Td | One Pair  | One Pair  | 1
2h4s | As2s | 6s Qh Jd 4c 3d | One Pair  | High Card | 1
As2s | 4sTs | Qh Jd 3h Td Kc | Straight  | One Pair  | 1
4sTs | 2s6s | Jd 3h 4c 3d Th | Two Pair  | One Pair  | 1
2s6s | TsQh | 3h 4c Td Kc 5d | Straight  | One Pair  | 1
TsQh | 6sJd | 4c Td 3d Th 7c | Trips     | One Pair  | 1
6sJd | Qh3h | Td 3d Kc 5d Ad | Flush     | One Pair  | 1
Qh3h | Jd4c | 3d Kc Th 7c 3c | Trips     | One Pair  | 1
Jd4c | 3hTd | Kc Th 5d Ad Qs | Straight  | One Pair  | 1
3hTd | 4c3d | Th 5d 7c 3c Jc | Two Pair  | One Pair  | 1
4c3d | TdKc | 5d 7c Ad Qs 4h | One Pair  | High Card | 1
TdKc | 3dTh | 7c Ad 3c Jc 6c | Flush     | One Pair  | 1
3dTh | Kc5d | Ad 3c Qs 4h 8c | One Pair  | High Card | 1
Kc5d | Th7c | 3c Qs Jc 6c 9d | High Card | High Card | 1

The first two columns show the pocket cards dealt to player 1 and player 2, respectively, and the fourth and fifth columns give their final hands, again respectively.

Immediately, the reader will recognize that there are 52! possible stackings of the deck, a number with more than 60 digits. Searching them all, even at supercomputer speed, would easily take longer than the life of the universe (at a trillion stackings per second, the estimate is on the order of $10^{50}$ years). So we cannot possibly look through them all.

The idea behind the local search is simple, and we’ve seen it before on this blog when we looked at decrypting substitution ciphers. We use steepest ascent in the discrete space of all orderings of the deck where the function to optimize is the number of cuts that win for the first player dealt. Neighbors consist of the set of stackings obtainable by swapping pairs of cards, and we use random restarts to avoid local maxima. Moreover, we parallelized the algorithm to run on a desktop supercomputer, access to which is graciously provided by the UIC mathematical computer science department. The result is lightning-fast results.

In other words, our algorithm starts with a random shuffle of the deck. We compute the value of a deck by dealing out a game for each of the 52 cuts of the deck, and seeing how many times player 1 wins. Call that value $v$. Then we compare $v$ with the value of all neighboring decks, where a neighbor is obtained by swapping any two cards in the deck. If any such stacking has a higher value, we take the stacking with the highest value, and repeat the process. Once we get to the point where no such neighboring deck has a higher value, we stop and output the final stacking. Then we start the whole process over again with a new random deck, and stop when we find an optimal stacking (or continue searching for more).

Note our choice of 2 player Texas Hold ‘Em is arbitrary, as is our preference for player 1 to win. We have found a number of optimal stackings for player 2 as well, and it’s not hard to imagine there are optimal stackings for Hold ‘Em with any number of players. We leave that for future research.

We chose to implement this algorithm in C++. Now this is one of the few times this author would willingly program in C/C++, because frankly the language is a jungle. A professor of mine once said “You can write programs fast, or you can write fast programs.” What he meant by that was you can write programs in a language that lets you express ideas concisely and clearly, or you can spend days laboriously working in C/C++ and end up with a program that will leave all other programs in the dust. Because at first we expected optimal stackings to be sparse, we imagined speed would be quite important, and that we’d have to analyze billions of different decks before we could find an optimal stacking. In hindsight this was probably unnecessary, but we do admittedly wallow in pride when quoting statistics about how fast the program runs.

All of the code we used is available for free on this blog’s Google Code page. We also include a text file containing the list of about 100,000 optimal stackings we found, and similar tables as above detailing the hands dealt for each cut.

In any case, we separated the code into three parts. First we implemented a general steepest-ascent framework for any problem. It’s general enough that the open-minded reader could easily use it to do steepest ascent for any problem, and one only needs implement the functions to generate neighbors and compute the value of a position. To give a pedantic (yet unrealistic) example, our posted code includes an instance of optimizing a polynomial function in one variable.

Second, we implemented a poker-hand evaluator and neighbor generator. In this part, we borrowed the lookup-table/perfect hashing methods first discovered by Kevin Suffecool and augmented by Paul Senzee. We do note that we take the “slow” method of evaluating a seven card hand, checking all 21 five-card poker hands and choosing the best. Otherwise, enlarging the lookup-table to cover seven-card hands would incur more than a 2,000 times increase in memory use, with only a speedup factor of 10 to 15 times. Considering that we’re running this on someone else’s personal supercomputer, we found it prudent to accept a smaller memory footprint. And even so, on the supercomputer we can evaluate about 7 million seven-card hands per second, down from about 87 million five-card hands per second. We gain the potential to go up to 87 million seven-card hands per second with the added memory footprint, but as it turns out we can find plenty of optimal stackings with the slower version.

Third, we parallelized the program using OpenMP to run on 12 threads, increasing the runtime by a factor of 12 for sufficiently large problem sizes. For smaller problem sizes, we see a speedup factor of roughly 10x, giving a reasonable 80% efficiency.

In the next section we detail the important aspects of the code.

## Enter the Jungle

Unfortunately, a primer on C++ is far beyond the scope of this blog. The best introductory text we know of is Prata’s C++ Primer Plus, but short of reading this behemoth and becoming a guru, the reader may safely skip ahead to the next section. An equally strong understanding of the algorithm itself can also be gained from our Python post on cryptanalysis with ngrams, though the program is admittedly much slower.

The general steepest ascent algorithm features a virtual C++ class called Position. A position has three functions: value(), neighbors(), and show(), the last of which returns a string containing a textual representation of a Position. Here are snippets of the code, taken from my hillclimb.h and hillclimb.cpp files:

class Position {
public:
Position();
virtual ~Position();
virtual double value() = 0;
virtual std::vector<Position *> *neighbors() = 0;
virtual std::string show() = 0;
};

We also define the “hillclimb” function which operates on positions:

Position* hillclimb(Position* posn, const int numSteps);

One important note is that we require the caller of the neighbors function to assume ownership of all the pointers in the returned vector, along with the returned vector itself. This author has a minor suspicion that this isn’t the proper way to do things in C++, but nobody has remarked otherwise.

Then the hillclimb function need know nothing about poker to work:

Position* hillclimb(Position* posn, const int numSteps) {
double value = posn->value(), nextValue;
int i;
bool foundBigger;

vector<Position *> *nbrs;
vector<Position *>::iterator itr;

for (i = 0; i < numSteps; i++) {
nbrs = posn->neighbors(); // allocates neighboring Positions

foundBigger = false;
for (itr = nbrs->begin(); itr != nbrs->end(); itr++) {
nextValue = (*itr)->value();

if (nextValue > value) {
foundBigger = true;
value = nextValue;
posn = *itr;
}
}

// free the computed neighbors
for (itr = nbrs->begin(); itr != nbrs->end(); itr++)
if (*itr != posn)
delete *itr;

delete nbrs;

if (!foundBigger)
break;
}

return posn;
}

Note that in the parallelization step we had to avoid the use of iterators, but that only changes the code superficially by adding additional indices.

For a short example on how to use this framework, the reader should investigate our quarticopt.h and quarticopt.cpp to maximize a quartic polynomial, included on this blog’s Google Code page with the rest of the code for this post.

The poker part required a bit of preprocessing to understand, but it is essentially a simple idea. First, we note that even though there are a lot of poker hands, many are equivalent up to scoring. For example, there are 1,020 different hands of the form AKQJ9 which score “high card,” up to a choice of suits which does not make the hand a flush. But none of these hands is any better than another. In other words, there are only 7,462 distinct five-card poker hands, which is a lot smaller than the some 2.5 million possible distinct five-card general hands.

The poker scoring function hence takes as in put a five-card hand and computes a number between 1 and 7,462 representing the absolute score of the hand. It does this through a combination of accessing a pre-generated look-up table for flushes and high-card hands, and a perfect hashing technique for the remaining hands. The interested reader will see Suffecool’s website for more details. It is quite novel, and results in a lightning-fast scoring algorithm.

Finally, we generate neighbors and score hands in the obvious (albeit tedious) way:

vector<Position *> *Deck::neighbors() {
vector<Position *> *listOfNeighbors = new vector<Position *>;
int i, j, temp;

for (i = 0; i < DECK_SIZE - 1; i++) {
for (j = i + 1; j < DECK_SIZE; j++) {
// copy to a new Deck
Deck *neighboringDeck = new Deck(*this);

// swap a pair of cards
temp = neighboringDeck->cards[i];
neighboringDeck->cards[i] = neighboringDeck->cards[j];
neighboringDeck->cards[j] = temp;

listOfNeighbors->push_back(neighboringDeck);
}
}

return listOfNeighbors;
}

double Deck::value() {
int cutPosn, count = 0, score1, score2;
int p1hand[7], p2hand[7];

for (cutPosn = 0; cutPosn < DECK_SIZE; cutPosn++) {
p1hand[0] = cards[cutPosn];
p1hand[1] = cards[(cutPosn + 2) % DECK_SIZE];
p1hand[2] = cards[(cutPosn + 5) % DECK_SIZE];
p1hand[3] = cards[(cutPosn + 6) % DECK_SIZE];
p1hand[4] = cards[(cutPosn + 7) % DECK_SIZE];
p1hand[5] = cards[(cutPosn + 9) % DECK_SIZE];
p1hand[6] = cards[(cutPosn + 11) % DECK_SIZE];

p2hand[0] = cards[(cutPosn + 1) % DECK_SIZE];
p2hand[1] = cards[(cutPosn + 3) % DECK_SIZE];
p2hand[2] = cards[(cutPosn + 5) % DECK_SIZE];
p2hand[3] = cards[(cutPosn + 6) % DECK_SIZE];
p2hand[4] = cards[(cutPosn + 7) % DECK_SIZE];
p2hand[5] = cards[(cutPosn + 9) % DECK_SIZE];
p2hand[6] = cards[(cutPosn + 11) % DECK_SIZE];

score1 = eval7Hand(p1hand);
score2 = eval7Hand(p2hand);

if (score1 < score2)
count++;
}

return count;
}

Where “eval7Hand” evaluates a seven-card poker hand to find the best five-card sub-hand as described above (and smaller scores are better). We note that the next time we work on this problem, we will likely extend the code to work for any number of hands and any number of players, and replace the ugly assignment operators above with a few loops. Moreover, we leave out an additional optimization from this code snippet for brevity.

Running the main application gives an output similar to the large table above for each discovered optimal stacking.

## Parallelization

A thorough introduction to parallel computation is again (sadly) beyond the scope of this post. Luckily, we chose the arguably simplest possible library for parallelization: OpenMP. With OpenMP, the serial code looks almost identical to the parallel code, with the exception of a few sagely placed pragmas (code annotations). For instance, our main function parallelizes at the top-most level, so that each thread performs steepest ascent trials independently of the others. From this perspective, the steepest ascent algorithm is pleasingly parallel. The code for that is:

int main(int argc, char* argv[]) {
[...]

vector<Deck *> optimalStackings;
#pragma omp parallel shared(optimalStackings)
{
Deck *bestSoFar = new Deck(), *result;
int i;

#pragma omp for schedule(static) nowait
for (i = 0; i < numTrials; i++) {
Deck startingDeck;
result = (Deck *)hillclimb(&startingDeck, trialLength);

if (result->value() == 52) {
#pragma omp critical
{
cout << result->show() << "\n";
optimalStackings.push_back(result);
}

if (bestSoFar->value() < 52)
delete bestSoFar;

bestSoFar = result;
} else if (bestSoFar->value() < result->value()) {
delete bestSoFar;
bestSoFar = result;
} else {
delete result;
}
}
}

[...]

return 0;
}

For an explanation of the pragmas and their syntax, see this reference/tutorial from Lawrence Livermore National Labs. In the code posted, we tried two different levels of parallelization. First, we parallelized at the top-most level, as above. The second way was  at the finest level, so that all threads were cooperating to evaluate the score of a single deck and compute neighbors. The former method turned out to be somewhat more efficient than the latter for large problem sizes. Specifically, in 1,000 iterations of a steepest ascent, the former was 17% faster. The speed increase is due to the significantly less system time overhead to schedule the threads, and the fact that evaluating a single deck is very fast, so at the end of every loop the thread synchronization time begins to outweigh the deck evaluation time. However, for smaller problem sizes the efficiency seems to drop off at a slower rate. We include a summary table of all of these details in a file called “interesting-properties.txt” included in the code archive.

We also include the full source code for the fine-granularity parallelization, with the mindset that it may perform better in other application domains, and it serves as an example implementation.

## Results

Running the algorithm overnight on the supercomputer gave us a list of approximately 100,000 distinct optimal stackings for two-player Hold ‘Em. In particular, we discovered the following interesting properties of stackings:

• Optimal stackings are plentiful. In particular, about one in six randomly chosen decks can be improved to an optimal stacking.
• The number of steps in the hill-climbing process is small. Specifically, in all of our tests we never witnessed a deck which required more than 15 steps to reach a local maximum. We do not have a proof that this is upper bound is sharp. Moreover, some decks were optimized with as few as 5 steps.
• There are optimal stackings both for player 1 and player 2.

We include the list of about 100,000 optimal stackings for the reader’s viewing pleasure, along with the game tables for their corresponding cuts, in a text file on this blog’s Google Code page. It’s a relatively large file (compressed ~70MB, uncompressed ~450MB). We note the possibility that some of these decks are redundant, but the probability of that happening is decidedly small. For a smaller set of examples, one can download the source code and look for a list of a thousand optimal stackings in the “decks” directory. We encourage the reader to try one out at a friendly neighborhood poker game. And finally, here is a list of ten optimal stackings, so that casual readers can avoid downloading and navigating the directory.

But there is still much we can ask about Hold ‘Em. For instance, how hard is it to find optimal stackings for any given player when there are $n$ players? More specifically, how fast does the complexity grow when you add more players? Are there significantly more semi-optimal stackings, in which we allow ties for player 1? What sorts of interesting patterns occur in optimal or semi-optimal stackings? Many of these sorts of questions could be answered by minor modifications to the source code provided. We leave that as an exercise for the reader, until we find enough time ourselves to go through and find optimal stackings which we believe to exist for any reasonable number of players and any position of the winning player.