# Kolmogorov Complexity – A Primer

## The Complexity of Things

Previously on this blog (quite a while ago), we’ve investigated some simple ideas of using randomness in artistic design (psychedelic art, and earlier randomized css designs), and measuring the complexity of such constructions. Here we intend to give a more thorough and rigorous introduction to the study of the complexity of strings. This naturally falls into the realm of computability theory and complexity theory, and so we refer the novice reader to our other primers on the subject (Determinism and Finite Automata, Turing Machines, and Complexity Classes; but Turing machines will be the most critical to this discussion).

## The Problem with Randomness

What we would really love to do is be able to look at a string of binary digits and decide how “random” it is. On one hand, this is easy to do at a high level. A child would have no difficulty deciding which of the following two strings is more random:

10101010101010101010101010101010101010101010101010
00011101001000101101001000101111010100000100111101

And yet, by the immutable laws of probability, each string has an equal chance ($2^{-50}$) in being chosen at random from all sequences of 50 binary digits. So in a sense, the huge body of mathematics underlying probability has already failed us at this basic juncture because we cannot speak of how random one particular outcome of an experiment is. We need a new notion that overcomes this difficulty.

Definition: The Kolmogorov complexity of a string $w$, denoted $K(w)$ is the length of the shortest program which outputs $w$ given no input.

While this definition is not rigorous enough to be of any use (we will reformulate it later), we can easily see why the first of the two strings above is less random. We can write a very short Python program that outputs it:

print "01" * 25

On the other hand, it is not hard to see that the shortest program that produces the second string is

print "00011101001000101101001000101111010100000100111101"

The dutiful reader will cry out in protest. How do you know that’s the shortest program? Why restrict to Python? This whole discussion is so arbitrary!

Indeed, this is probably not the strictly shortest Python program that prints out the string. In the following we’ll work entirely in binary, so the picky reader should interpret this as a print command referring to a block of binary memory. We will reify these ideas in full rigor shortly, but first let us continue with this naivete to make the coming definitions a bit easier to parse.

If we abstract away the lengths of these strings, we see that the length of the first program is $O(1) + \log(n)$, since we need $\log(n)$ bits to represent the number $n/2$ in the string product. On the other hand, the second string has a program length of $O(1) + n$, as we require the entire output string as program text.

This intuition would lead us to define a sequence of length $n$ to be random if it has Kolmogorov complexity at least $n$. One way to interpret this is: a string is “random” if the shortest program that outputs the string basically encodes the entire string in its source code.

We can extend this idea to talk about relative complexity. Specifically, we can speak of Python programs which accept input, and compute their output based on that input. For instance, the first of the two strings above has the program:

n = input()
print "01" * n/2

With respect to the input “50”, we see that the first string has constant complexity (indeed, this is also true of many numbers, such as 25). In other words, the string “50” contains a lot of information about the string we wish to generate (it’s length).

On the other hand, the same technique still won’t work for the second string. Even though it has length 50, that is not enough information to determine the contents of the string, which vary wildly. So the shortest program is still (probably) the one which prints out the string verbatim.

In the future, we plan to revisit the idea of relative complexity in the context of machine learning and classification; briefly, two items are similar if one has low complexity relative to the other. But for now, let us turn to the more precise definitions.

## Actual Kolmogorov Complexity

We keep saying of the second string above that the shortest program is probably the one which prints the string verbatim. In fact, short of testing every single python program of shorter length, we will never know if this is true! Even if we did, our following definitions will make that discovery irrelevant. More generally, it’s an important fact that Kolmogorov complexity is an uncomputable function. That is, there is no Turing machine $M$ which accepts as input a word $w$ and produces its Kolmogorov complexity as output. In order to prove such miraculous things, we need to formalize the discussion in the context of Turing machines.

Let us fix a universal programming language $L$ and speak of the Kolmogorov complexity with respect to $L$:

Definition: The Kolmogorov complexity of a string $w$ with respect to $L$, denoted $K_L(w)$ is the shortest program written in the language $L$ which produces $w$ as output. The conditional Kolmogorov complexity with respect to a string $x$, denoted $K_L(w | x)$ (spoken $w$ given $x$, as in probability theory), is the length of the shortest program which, when given $x$ as input, outputs $w$.

Before we can prove that this definition is independent of $L$ (for all intents and purposes), we need a small lemma, which we have essentially proved already:

Lemma: For any strings $w,x$ and any language $L$, we have $K_L(w | x) \leq |w| + c$ for some constant $c$ independent of $w, x$, and $K_L(w) \leq |w| + c’$ for some constant $c’$ independent of $w$.

Proof. The program which trivially outputs the desired string has length $|w| + c$, for whatever constant number of letters $c$ is required to dictate that a string be given as output. This is clearly independent of the string and any input. $\square$

It is not hard to see that this definition is invariant under a choice of language $L$ up to a constant factor. In particular, let $w$ be a string and fix two languages $L, L’$. As long as both languages are universal, in the sense that they can simulate a universal Turing machine, we can relate the Kolmogorov complexity of $w$ with respect to both languages. Specifically, one can write an interpreter for $L$ in the language of $L’$ and vice versa. Readers should convince themselves that for any two reasonable programming languages, you can write a finite-length program in one language that interprets and executes programs written in the other language.

If we let $p$ be the shortest program written in $L$ that outputs $w$ given $x$, and $i$ be the interpreter for $L$ written in $L’$, then we can compute $w$ given the input $\left \langle p, x \right \rangle$ by way of the interpreter $i$. In other words, $K_L(w | px) \leq |i|$.

$K_{L’}(w | x) \leq |p| + c + |i| = K_L(w | x) + c + |i| = K_L(w | x) + O(1)$

Another easy way to convince oneself of this is to imagine one knows the language $L’$. Then the program would be something akin to:

input x
run the interpreter i on the program p with input x

Our inequalities above just describe the length of this program: we require the entire interpreter $i$, and the entire program $p$, to be part of the program text. After that, it’s just whatever fixed constant amount of code is required to initialize the interpreter on that input.

We call this result the invariance property of Kolmogorov complexity. And with this under our belt, we may speak of the Kolmogorov complexity of a string. We willingly ignore the additive constant difference which depends on the language chosen, and we may safely work exclusively in the context of some fixed universal Turing machine (or, say, Python, C, Racket, or Whitespace; pick your favorite). We will do this henceforth, denoting the Kolmogorov complexity of a string $K(w)$.

## Some Basic Facts

The two basic facts we will work with are the following:

• There are strings of arbitrarily large Kolmogorov complexity.
• Most strings have high complexity.

And we will use these facts to prove that $K$ is uncomputable.

The first point is that for any $n$, there is a string $w$ such that $K(w) \geq n$. To see this, we need only count the number of strings of smaller length. For those of us familiar with typical combinatorics, it’s just the Pigeonhole Principle applied to all strings of smaller length.

Specifically, there is at most one string of length zero, so there can only be one string with Kolmogorov complexity zero; i.e., there is only one program of length zero, and it can only have one output. Similarly, there are two strings of length one, so there are only two strings with Kolmogorov complexity equal to one. More generally, there are $2^i$ strings of length $i$, so there are at most $\sum_{i = 0}^{n-1} 2^i$ strings of Kolmogorov complexity less than $n$. However, as we have seen before, this sum is $2^n – 1$. So there are too many strings of length $n$ and not enough of smaller length, implying that at least one string of length $n$ has Kolmogorov complexity at least $n$.

The second point is simply viewing the above proof from a different angle: at least 1 string has complexity greater than $n$, more than half of the $2^n$ strings have complexity greater than $n-1$, more than three quarters of the strings have complexity greater than $n-2$, etc.

Rigorously, we will prove that the probability of picking a string with $K(x) \leq n – c – 1$ is smaller than $1/2^c$. Letting $S$ be the set of all such strings, we have an injection $S \hookrightarrow$ the set of all strings of length less than $n – c- 1$, and there are $2^{n-c} – 1$ such strings, so $|S| \leq 2^{n-c} – 1$, giving the inequality:

$\displaystyle \frac{|S|}{2^n} \leq \frac{2^{n-c} – 1}{2^n} = \frac{1}{2^c} – \frac{1}{2^n} < \frac{1}{2^c}$

In other words, the probability that a randomly chosen string $w$ of length $n$ has $K(w) \geq n-c$ is at least $1 – 1/2^c$.

The strings of high Kolmogorov complexity have special names.

Definition: We call a string $w$ such that $K(w) \geq |w|$ a Kolmogorov random string, or an incompressible string.

This name makes sense for two reasons. First, as we’ve already mentioned, randomly chosen strings are almost always Kolmogorov random strings, so the name “random” is appropriate. Second, Kolmogorov complexity is essentially the ideal compression technique. In a sense, strings with high Kolmogorov complexity can’t be described in any shorter language; such a language would necessarily correspond to a program that decodes the language, and if the compression is small, so is the program that decompresses it (these claims are informal, and asymptotic).

## Uncomputability

We will now prove the main theorem of this primer, that Kolmogorov complexity is uncomputable.

Theorem: The Kolmogorov complexity function $w \mapsto K(w)$ is uncomputable.

Proof. Suppose to the contrary that $K$ is computable, and that $M$ is a Turing machine which computes it. We will construct a new Turing machine $M’$ which computes strings of high complexity, but $M’$ will have a short description, giving a contradiction.

Specifically, $M’$ iterates over the set of all binary strings in lexicographic order. For each such string $w$, it computes $K(w)$, halting once it finds a $w$ such that $K(w) \geq |w| = n$. Then we have the following inequality:

$n \leq K(w) \leq \left | \left \langle M’, n \right \rangle \right | + c$

Here, the angle bracket notation represents a description of the tuple $(M’, n)$, and $c$ is a constant depending only on $M’$, which is fixed. The reason the inequality holds is just the invariance theorem: $\left \langle M’, n \right \rangle$ is a description of $w$ in the language of Turing machines. In other words, the universal Turing machine which simulates $M’$ on $n$ will output $w$, so the Kolmogorov complexity of $w$ is bounded by the length of this description (plus a constant).

Then the length of $\left \langle M’, n \right \rangle$ is at most $\log(n) + \left | \left \langle M’ \right \rangle \right | + c’$ for some constant $c’$, and this is in turn $\log(n) + c”$ for some constant $c”$ (as the description of $M’$ is constant). This gives the inequality

$n \leq \log(n) + c”$

But since $\log(n) = o(n)$, we may pick sufficiently large $n$ to achieve a contradiction. $\square$

We can interpret this philosophically: it is impossible to tell exactly how random something is. But perhaps more importantly, this is a genuinely different proof of uncomputability from our proof of the undecidability of the Halting problem. Up until now, the only way we could prove something is uncomputable is to reduce it to the Halting problem. Indeed, this lends itself nicely to many new kinds of undecidability-type theorems, as we will see below.

The reader may ask at this point, “Why bother talking about something we can’t even compute!?” Indeed, one might think that due to its uncomputability, Kolmogorov complexity can only provide insight into theoretical matters of no practical concern. In fact, there are many practical applications of Kolmogorov complexity, but in practice one usually gives a gross upper bound by use of the many industry-strength compression algorithms out there, such as Huffman codes. Our goal after this primer will be to convince you of its remarkable applicability, despite its uncomputability.

## Consequences of Uncomputability and Incompressibility

One immediate consequence of the existence of incompressible strings is the following: it is impossible to write a perfect lossless compression algorithm. No matter how crafty one might be, there will always be strings that cannot be compressed.

But there are a number of other, more unexpected places that Kolmogorov complexity fills a niche. In computational complexity, for instance, one can give a lower bound on the amount of time it takes a single-tape Turing machine to simulate a two-tape Turing machine. Also in the realm of lower bounds, one can prove that an incompressible string $w$ of length $2^n$, which can be interpreted as a boolean function on $n$ variables, requires $\Omega(2^n/n)$ gates to express as a circuit.

It also appears outside of theoretical computer science. In, for instance, the study of entropy in dynamical systems (specifically, thermodynamics), one can make the following pictorial remark:

In fact, the author of this image, Scott Aaronson (whom we’ve seen before in our exploration of the Complexity Zoo) even proposes an empirical test of this fact: simulate a discretized coffee cup and try to compress the data at each step, graphing the resulting lengths to see the trends. This even sounds like a good project for this blog!

The applications don’t end there, though. Researchers have used the theory of Kolmogorov complexity to tackle problems in machine learning, clustering, and classification. Potentially, any subject which is concerned with relating pieces of information could benefit from a Kolmogorov-theoretic analysis.

Finally, we give a proof that the existence of Kolmogorov complexity provides an infinite number of mathematical statements which are unprovable.

Theorem: Fix a formalization of mathematics in which the following three conditions hold:

• If a statement is provable then it is true.
• Given a proof and a statement, it is decidable whether the proof proves the statement.
• For every binary string $x$ and integer $k$, we can construct a statement $S(x,k)$ which is logically equivalent to “the Kolmogorov complexity of $x$ is at least $k$”.

Then there is some constant $t$ for which all statements $S(x,k)$ with $k > t$ are unprovable.

Proof. We construct an algorithm to find such proofs as follows:

On an input k,
Set m equal to 1.
Loop:
for all strings x of length at most m:
for all strings P of length at most m:
if P is a proof of S(x,k), output x and halt
m = m+1

Suppose to the contrary that for all $k$ there is an $x$ for which the statement $S(x,k)$ is provable. It is easy to see that the algorithm above will find such an $x$. On the other hand, for all such proofs $P$, we have the following inequality:

$k \leq K(x) \leq \left | \left \langle M, k \right \rangle \right | + c = \log(k) + c’$

Indeed, this algorithm is a description of $x$, so its length gives a bound on the complexity of $x$. Just as in the proof of the uncomputability of $K$, this inequality can only hold for finitely many $k$, a contradiction. $\square$

Note that this is a very different unprovability-type proof from Gödel’s Incompleteness Theorem. We can now construct with arbitrarily high probability arbitrarily many unprovable statements.

Look forward to our future posts on the applications of Kolmogorov complexity! We intend to explore some compression algorithms, and to use such algorithms to explore clustering problems, specifically for music recommendations.

Until then!

# Turing Machines – A Primer

We assume the reader is familiar with the concepts of determinism and finite automata, or has read the corresponding primer on this blog.

## The Mother of All Computers

Last time we saw some models for computation, and saw in turn how limited they were. Now, we open Pandrora’s hard drive:

Definition: A Turing machine is a tuple $(S, \Gamma, \Sigma, s_0, F, \tau)$, where

• $S$ is a set of states,
• $\Gamma$ is a set of tape symbols, including a special blank symbol $b$,
• $\Sigma \subset \Gamma$ is a set of input symbols, not including $b$,
• $s_0$ is the initial state,
• $A \subset S$ is a set of accepting states,
• $R \subset S$ is a set of rejecting states,
• $\tau: S – (A \cup R) \times \Gamma \to S \times \Gamma \times \left \{ L, R \right \}$ is a partial function called the transition function, where $L, R$ correspond to “left shift” and “right shift,” respectively.

There are a few extra components we must address to clearly see how a Turing machine operates. First, the Turing machine has a tape of cells, infinite in length, upon which the machine may read and write letters from $\Gamma$. The process of reading a letter, analogous to our encounter with pushdown automata, is encapsulated in the $\Gamma$ component of the domain of $\tau$. In other words, this machine no longer is “fed” input in sequence. Rather, input is initially written to the tape, and the Turing machine receives this input by reading the tape. The rest of the tape (the complement of the finitely many cells containing input) is filled with $b$. Similarly, the process of writing to the tape is encapsulated in the $\Gamma$ component of the codomain of $\tau$.

The “shifting” part of $\tau$ requires another explanation. First, we restrict the Turing machine to being able to see only one cell of the tape at a time. In order to better visualize this, we invent a read-write head for the machine, which can by construction only process one cell at a time. Hence, the sequence of state transition goes: read a symbol from the cell currently under the read-write head, transition from one state to another, write a symbol to the same cell, then shift the read-write head one cell to the left or right.

Finally, we only allow entry into an accept or reject state. Once the machine enters one such state, it halts and outputs its respective determination.

Now, we could provide a detailed example of a Turing machine, with every aspect of the above definition accounted for. However, that is an unnecessarily bloated endeavor, and we leave such obsequiousness to Wikipedia, instead focusing on the bigger picture at hand. We gratefully take the liberty to stand on the lemma-encrusted shoulders of giants, and simply describe algorithms that are provably encodable on a Turing machine. The nature of permissible algorithms will become clearer as we give more examples.

## The Halting Problem

We now find ourselves capable of performing a very important new operation: infinitely looping. Specifically, it is not hard to design a Turing machine which never enters an accepting or rejecting state. Simply have one non-accept/reject state, $s$, and if we are in state $s$, shift left and write a 1. Despite having a finite input, this operation will never cease, nor will it ever be in the same configuration twice. This was never possible with a DFA, NFA, or PDA, because computation always ended with the last input symbol!

We require some new terminology to deal with this:

Definition: If a Turing machine halts on a given input, either accepting or rejecting, then it decides the input. We call an acceptance problem decidable if there exists some Turing machine that halts on every input for that problem. If no such Turing machine exists, we call the problem undecidable over the class of Turing machines.

In particular, we may describe our algorithms as vaguely as we wish, as long as it is clear that each step is provably decidable. Further, we may now write algorithms which loop over some decidable condition:

while the number of '1's on the tape is even:
move the head to a blank cell
write 'x' to the tape

accept

Notice that the above algorithm halts if and only if the tape begins with an odd number of ‘1’s written to it, and it never rejects.

Now we are threatened with a very dangerous question: how can we know a Turing machine will halt, accepting or rejecting appropriately? Rather than tackle this hard question, we will use it to our advantage to prove some amazing things. But first, we need to build up more machinery.

A Turing machine may additionally simulate storage: it may box off arbitrarily large portions of the tape to contain data we wish to save, including bounded numbers, characters (or numerical representations of characters), and larger compound data structures.

Finally, and this requires a small leap of faith, we may encode within a Turing machine descriptions of other Turing machines, and then process them. Indeed, we must accept that these descriptions are finite, for any Turing machine with infinite description would be effectively useless. Then, we may develop some fixed process for encoding a Turing machine as a string of 1’s and 0’s (say, a collapsed table of its state transitions). This is a function from the set of Turing machines to the set of descriptions, and we denote the encoding of $T$ as $[T]$.

Before actually using Turing machines as inputs to other Turing machine, we glean some important information about encodings. Since the set of finite strings (encodings) over any fixed alphabet is countable, we conclude that there are only countably many possible Turing machines. However, the set of subsets (possibly infinite) of the same fixed alphabet is uncountably large. Since every Turing machine can only decide one problem, there must exist uncountably many problems which are undecidable by the class Turing machines! Now, if we really wish, we may encode Turing machines by a single natural number, with respect to a fixed bijection with $\mathbb{N}$. For a refresher on countability and uncountability, see our primer on the subject.

Since we may encode the logic of one Turing machine, say $T$, within another, say $U$ we may use the tape and head of $U$ to simulate $T$ on a given input! We leave it as an exercise to the reader to figure out how to manage the tape when it must contain an encoding of $T$ and still simulate the tape of $T$. We call $U$ a universal Turing machine, or UTM. Now we see that Turing machines can reason about other Turing machines. Brilliant!

But now that we’ve established the existence of undecidable problems, we are given the task of finding one. We do so by construction, and arrive at the famous halting problem.

We denote an encoding of a Turing machine $T$ and an input $w$ to $T$ together as a pair $[T,w]$. Then, we construct the set of halting machine-input pairs:

$H = \left \{ [T,w] | T \textup{ is a Turing machine, } w \textup{ is an input to } T, \textup{ and } T \textup{ halts on } w \right \}$

We conjecture that this problem is undecidable, and prove it so by contradiction. Proof. Suppose $U$ is a Turing machine which decides acceptance in $H$. Construct another Turing machine $V$ as follows.

On input [T] (T is a Turing machine):
run U on [T,T]
if U rejects, accept
if U accepts, loop infinitely

Before the crux of the proof, let us recall that $U$ simply determines whether $T$ halts on an input. Then, when we run $V$ on $[T]$, we have the sub-computation of deciding whether $T$ halts when run on its own description. In this case, $V$ accepts when $T$ loops infinitely when run on itself, and $V$ loops infinitely otherwise.

Now (the magic!) run $V$ on $[V]$. If $V$ accepts, that means $V$, when run on itself, does not halt (i.e. $U$ rejects $[V,V]$), a contradiction. On the other hand, if $V$ loops infinitely, then $U$ rejects $[V,V]$, implying $V$ accepts, a contradiction.

Thus, we have proven that $V$ both halts and does not halt when run on itself! This glaring contradiction implies that $V$ cannot exist. But we built $V$ up from $U$ without logical error, so we conclude that $U$ cannot exist, and the theorem is proved.

## Wrapping Up

The theory of computing goes much further than the halting problem. Indeed, most undecidable problems are proven so by reducing them to the halting problem (if one can decide problem $X$ then one can decide the halting problem, a contradiction). But beyond decidability, there is a large field of study in computational efficiency, in which all studied algorithms are run on a Turing machine. Further, studies of complexity and alternative computational models abound, including a persistent problem of classifying “how hard” problems are to compute. The interested reader should Google “P vs. NP” for more information. Unfortunately, an adequate description of the various time classes and problems therein is beyond the scope of this blog. All we require is a working knowledge of the terminology used in speaking of Turing machines, and an idea of what kinds of algorithms can be implemented on one.

That’s all for now!

# Determinism and Finite Automata – A Primer

This half of the theory of computing primer will cover the various finite automata, including deterministic, nondeterministic, and pushdown automata. We devote the second half [upcoming] entirely to Turing machines and the halting problem, but to facilitate the discussion of Turing machines we rely on the intuition and notation developed here.

## Defining Computation

The first step in studying the sorts of possible computations (and more interestingly, those things which cannot be computed) is to define exactly what we mean by a “computation.” At a high level, this is easy: a computation is simply a function. Given some input, produce the appropriate output.

Unfortunately this is much too general. For instance, we could define almost anything we want in terms of functions. Let $f$ be the function which accepts as input the date of California Super Lotto drawings, and returns the set of winning numbers for that date. Of course (assuming the winning numbers are chosen randomly), this is ridiculous and totally uninformative. We cannot learn anything interesting about computations if they are performed behind an opaque veil. Instead, we need to rigorously define the innards of our model for computation, and then see what kinds of things it can do.

Because our inputs may be long and complicated, our model should break an input into pieces and work on the pieces in sequence. This motivates the following two definitions:

Definition: An alphabet is a finite set, usually denoted $\Sigma$.

Definition: $\Sigma^*$ is the set of finite strings whose characters come from $\Sigma$. We include the empty string $\varepsilon$ as the string of zero length. This operation is called the Kleene star.

We will henceforth let inputs to our model be elements of $\Sigma^*$ for some fixed alphabet $\Sigma$, and operate on each character one at a time.

Since $\Sigma$ is finite, we may assign to each element a unique finite string of ones and zeros (to keep things simple, let each string have length equal to ceiling of $\log |\Sigma|)$. Then, we can simulate any alphabet by the appropriate subset of $\left \{0,1 \right \}^*$. So without loss of generality, we may always assume our input comes from $\left \{ 0,1 \right \}^*$, and thus we will henceforth have $\Sigma = \left \{ 0,1 \right \}$.

Now we need a clear objective for our model, and we will phrase it again in terms of $\Sigma^*$. Specifically, we pick a subset of $A \subset \Sigma^*$, and call it a language. Our computational model will accept an input, and output “accept” or “reject.” We say that our model recognizes $A$ if it accepts and only accepts those strings in $A$.

Notice that this is a bit different from our colloquial idea of “computation.” Specifically, all our model will do is determine presence of an element in a set which has a potentially complicated construction. While it at first seems that this does not include things like “adding two numbers,” it turns out that we may phrase such objectives in terms of acceptance and rejection. Let $A = \left \{ axbxc | a,b,c \in \mathbb{N}, a+b=c \right \}$, where $x$ is some fixed separator. If we fix $a$ and $b$, then we may run our computational model on successively higher values of $c$ (ordered by absolute value, or restricting everything to be positive) until finding acceptance.

Colloquially, we have defined “computation” as recognizing all inputs which satisfy a qestion. Moreover, we will see that very many complex problems can be recognized as recognition problems, including things that are logically impossible to compute within our model.

The reason we have been heretofore vague in naming our model is that we will actually define four different models of progressively stronger computational ability, and this framework will hold for all of them. So here our first try.

## Deterministic Finite Automata

This definition comes from the intuitive idea that a computation can be carried out via a set of states and transitions between those states. A very simple example is a light switch, which has the states ‘on’ and ‘off’, and which can accept as input ‘switch’ or ‘do nothing’. This model is rigorized here:

Definition: A deterministic finite automaton, abbreviated DFA, is a five-tuple $D = (S, \Sigma, \tau, s_0, F)$, where:

• $S$ is a set of states, with initial state $s_0$.
• $\Sigma$ is our working alphabet, and inputs come from $\Sigma^*$.
• $\tau$ is a transition function $S \times \Sigma \to S$, which maps the current state and next input character to the next appropriate state.
• $F \subset S$ is a set of final states. If the last input character lands us in any state $f \in F$, we accept.

Rigorously, let our input word $w \in \Sigma^*$ have length $n$ and characters indexed $w_k$. Call the sequence of states in the computation $s_k$, where $s_{k+1} = \tau(s_k, w_k)$. Then our DFA accepts $w$ if and only if $s_n \in F$. We will use this notation throughout our discussion of finite automata. We call the set of languages which a DFA can recognize the regular languages. We will soon see that this class is rather small.

Of course, we have some examples of DFAs. The simplest possible DFA has only one state, $s$, and the transition function $\tau(s,a) = s$ for all $a \in \Sigma$. Depending on whether $F$ is empty, this DFA will accept or reject all of $\Sigma^*$. We may visualize this DFA as a directed graph:

A trivial DFA, with one state.

Indeed, this visualization provides a convenient way for us to write out the transition function. We simply draw an edge originating at each state for each element of our alphabet. When two or more inputs have the same behavior, we label one edge with multiple input symbols, as above.

Here is a more interesting example: let $A$ be the set of binary strings which have an even number of zeros. We design a DFA to accept this as follows:

A DFA which accepts binary numbers with evenly many zeros.

The shaded state, $s_0$ is the only final state. It is obvious from tracing the diagram that this DFA accepts precisely the set of strings with evenly many zeros. Let’s try something harder:

Let $A = \left \{ 0^n1^n | n \in \mathbb{N} \right \}$. This is a simple enough language, so let us develop a DFA to recognize it. We can easily construct a DFA which recognizes $0^n1^n$ for a fixed $n$, and if we have two DFAs we can construct a DFA which recognizes their union quite easily (do so as an exercise!). However, due to the restriction that $S$ is finite, we cannot connect these pieces together indefinitely! While we might imagine some cleverly designed DFA exists to recognize $A$, we find it more likely that no such DFA exists.

Indeed, it has been proven that no such DFA exists. The key to the proof lies in an esoteric lemma called the Pumping Lemma. Being a notational beast of a lemma, we will not state it rigorously. Colloquially, the lemma says that if $A$ is a regular language, then any sufficiently large word $w \in A$ can be split up into three pieces $xyz$, such that the middle piece may be repeated arbitrarily many times, as $xyyyy \dots yz$, with the resulting word still being in $A$. Clearly this breaks the “balance” restraint of $A$ no matter where the splitting is done, and so $A$ cannot be regular.

Before we increase our model’s expressive power, let us make an “enhancement” to the DFA.

## Nondeterministic Finite Automata

Following the colloquial definition of nondeterminism, we can design our computational system to make state transitions “branch out.” This will be made clear by a slight modification of the DFA.

Definition: A nondeterministic finite automaton, abbreviated NFA, is a DFA with the following two modifications:

• Instead of having a single state $s_k$ at each step, we allow a set of possible states, which we denote $S_k \subset S$.
• Our initial state $s_0$ is replaced with an initial set of states $S_0 \subset S$.
• We include $\varepsilon$ in $\Sigma$, allowing for immediate transitions that require no input.
• The transition function $\tau$ now has signature $S \times \Sigma \to P(S)$, where $P(S)$ denotes the power set of $S$, the set of all subsets of $S$.
• At each step (with input character $w_k$), we map $\tau(-,w_k)$ over $S_k$ to get $S_{k+1}$, i.e. $S_{k+1} = \left \{ \tau(s, w_k) | s \in S_k \right \}$
• The NFA accepts if any state in $S_n$ is final.

In this way, our machine can get a character as input, and proceed to be in two separate states at the same time. If any of the branches of computation accept, then they all accept.

Extending our example from DFAs, here is an NFA which accepts the language of binary strings which have either an even number of zeros or an even number of ones (or both).

An NFA recognizing strings containing evenly many zeros or evenly many ones

Here, $S_0 = F = \left \{ s_0, s_2 \right \}$, and by tracing (a little more carefully this time), we can see that it accepts what we expect.

Now, with the added strength of nondeterminism, we might expect that NFAs can compute some things that DFAs cannot. Amazingly, this is not the case. We point the reader to a more detailed description, but trust that our quick explanation will give the necessary intuition to see the equivalence.

The key observation is that despite nondeterminism, there is still a finite number of possible states an NFA can be in. Specifically, there are at most $2^{|S|}$ possible subsets of $S$, so we can simply construct a DFA which has $2^{|S|}$ states, one corresponding to each subset of $S$, and build up a deterministic transition function that makes things work. This would be an appropriate exercise for the advanced reader, but the beginning student should see Sisper’s Introduction to the Theory of Computation, which contains a complete and thorough proof. (We hate to refer the reader to an expensive textbook, but we could only find sparse proofs of this equivalence elsewhere on the internet, and most assume a particular working set of notation. And Sisper’s book is much better than this terse post for a primer in Computing Theory. The interested reader who lacks mathematical maturity would find it very accessible.)

In other words, every NFA has a corresponding DFA which recognizes precisely the same language. On the other hand, every DFA is trivially an NFA. Thus, the class of languages which NFAs recognize is also the regular languages.

So we have discovered that regular languages, and hence the DFAs and NFAs which recognize them, are quite limited in expressive power. Now let’s truly beef up our model with some mathematical steroids.

## Pushdown Automata

Okay, so these steroids won’t be that strong. Ideally, we just want to add as little as possible to our model and still make it more expressive. This way we can better understand the nuances of computational power, thus strengthening our intuition. While “little” is a subjective notion, lots of “littler” things were tried before arriving at a model that was not equivalent in power to a DFA. We will refrain from explaining them here, except to say that DFAs are equivalent in power to the class of formal Regular Expressions (hence the name, regular languages), which most programmers are familiar with on an informal level.

Definition: A pushdown automaton, denoted PDA, is an NFA with two additional components: $\Gamma$ a set of stack symbols including $\varepsilon$, and a modified transition function $\tau$:

$\tau: S \times \Sigma \times \Gamma \to P(S) \times \Gamma^*$

Here, the left hand $\Gamma$ symbol represents the current top of the stack (which is always popped), and the right hand $\Gamma^*$ represents any modification to the stack during state transition. This modification is always a push, and can be an empty push, or a push of arbitrarily many symbols.

For brevity of a description of $\tau$, which would otherwise be infinite as there are infinitely many possible stacks, we allow $\tau$ to be a partial function, not defined for some inputs. Then a PDA automatically rejects any encountered undefined input. Alternatively, we could construct $\tau$ as a total function, if we add edges for each undefined input going to some terminal state which is inescapable and not final. For the sake of this post, we will accept a partial transition function.

Recalling our discussion of computational ability in our post on Conway’s Game of Life, we recognize that a PDA’s new power is in its ability to grow without bound. Specifically, the stack has no limit to its size, and we can remember important pieces of input which would otherwise be discarded.

This modification seems like it was made just for the $0^n1^n$ problem. Specifically, now we can just push the 0s we see on to the stack, and pop just 1’s until we see an empty stack. Here is a PDA which does just that (making heavy use of epsilon transitions):

A PDA which recognizes $0^n1^n$

Here $S_0 = \left \{ s_0 \right \}$, and $F = \left \{ s_2 \right \}$. For each state transition label, we include $w_k; a \to bc$ to mean, “upon receiving $w_k$ as input with $a$ on top of the stack, pop $a$ and replace it with $bc$, where $bc$ may be a single character or multiple characters or the empty character. The epsilon transitions allow us to move from $s_0$ to $s_1$ and from $s_1$ to $s_2$ seamlessly, but only when the stack is agreeable. As an exercise, modify the above diagram to instead recognize $0^{2n}1^n$.

Neato! It seems like we’ve got a huge boost in computational power now that we can store some information. Wrong. As great as a stack is, it’s not good enough. Here’s a language that a PDA cannot solve: $A = \left \{ 0^n1^n2^n | n \in \mathbb{N} \right \}$. This follows from an additional and more nuanced pumping lemma, and we will not dwell on it here. But look at that! This language is hardly any more complex than $0^n1^n$, and yet it stumps our PDA.

Time to up the steroid dosage. Surprisingly enough, it turns out that our solution will be equivalent to adding another stack! In doing so, we will achieve a level of computational expression so powerful that nothing we know of today can surpass it. We call this panacea of computational woes a Turing machine, and we will cover both it and the wild problems it cannot compute next time.

Until then!

Until next time!