# The Universal Properties of Map, Fold, and Filter

A lot of people who like functional programming often give the reason that the functional style is simply more elegant than the imperative style. When compelled or inspired to explain (as I did in my old post, How I Learned to Love Functional Programming), they often point to the three “higher-order” functions map, fold, and filter, as providing a unifying framework for writing and reasoning about programs. But how unifying are they, really? In this post we’ll give characterizations of these functions in terms of universal properties, and we’ll see that in fact fold is the “most” universal of the three, and its natural generalization gives a characterization of transformations of standard compound data types.

By being universal or having a universal property, we don’t mean that map, fold, and filter are somehow mystically connected to all programming paradigms. That might be true, but it’s not the point of this article. Rather, by saying something has a universal property we are making a precise mathematical statement that it is either an initial or final object (or the unique morphism determined by such) in some category.

That means that, as a fair warning to the reader, this post assumes some basic knowledge of category theory, but no more than what we’ve covered on this blog in the past. Of particular importance for the first half of this post is the concept of a universal property, and in the followup post we’ll need some basic knowledge of functors.

## Map, Filter, and Fold

Recalling their basic definitions, map is a function which accepts as input a list $L$ whose elements all have the same type (call it $A$), and a function $f$ which maps $A$ to another type $B$. Map then applies $f$ to every entry of $L$ to produce a new list whose entries all have type $B$.

In most languages, implementing the map function is an elementary exercise. Here is one possible definition in ML.

fun map(_, nil) = nil


Next, filter takes as input a boolean-valued predicate $p : A \to \textup{bool}$ on types $A$ and a list $L$ whose entries have type $A$, and produces a list of those entries of $L$ which satisfy the predicate. Again, it’s implementation in ML might be:

fun filter(_, nil) = nil
else filter(p, tail)


Finally, fold is a function which “reduces” a list $L$ of entries with type $A$ down to a single value of type $B$. It accepts as input a function $f : A \times B \to B$, and an initial value $v \in B$, and produces a value of type $B$ by recursively applying $f$ as follows:

fun fold(_, v, nil) = v


(If this definition is mysterious, you’re probably not ready for the rest of this post.)

One thing that’s nice about fold is that we can define map and filter in terms of it:

fun map(f, L) = fold((fn x, xs =&gt; (f(x):xs)), [], L)
fun filter(p, L) = fold((fn x, xs =&gt; if p(x) then x:xs else xs end), [], L)


We’ll see that this is no accident: fold happens to have the “most general” universality of the three functions, but as a warm-up and a reminder we first investigate the universality of map and filter.

## Free Monoids and Map

Map is the easiest function of the three to analyze, but to understand it we need to understand something about monoids. A monoid is simple to describe, it’s just a set $M$ with an associative binary operation denoted by multiplication and an identity element for that operation.

monoid homomoprhism between two monoids $M, N$ is a function $f: M \to N$ such that $f(xy) = f(x)f(y)$. Here the multiplication on the left hand side of the equality is the operation from $M$ and on the right it’s the one from $N$ (this is the same definition as a group homomorphism). As is to be expected, monoids with monoid homomorphisms form a category.

We encounter simple monoids every day in programming which elucidate this definition. Strings with the operation of string concatenation form a monoid, and the empty string acts as an identity because concatenating a string to the empty string has no effect. Similarly, lists with list concatenation form a monoid, where the identity is the empty list. A nice example of a monoid homomorphism is the length function. We know it’s a homomorphism because the length of a concatenation of two lists is just the sum of the lengths of the two lists.

Integers also form a monoid, with addition as the operation and zero as the identity element. However, the list and string monoids have an extra special property that integers do not. For a number $n$ you can always find $-n$ so that $n + (-n) = 0$ is the identity element. But for lists, the only way to concatenate two lists and get the empty list is if both of the lists were empty to begin with. A monoid with this property is called free, and to fully understand it we should make a definition which won’t seem to mean anything at first.

Definition: Let $\mathbf{C}$ be a category. Given a set $A$, the free object over $A$, denoted $F(A)$, is an object of $\mathbf{C}$ which is universal with respect to set-maps $A \to B$ for any object $B$ in $\mathbf{C}$.

As usual with a definition by universal property, we should elaborate as to exactly what’s going on. Let $\mathbf{C}$ be a category whose objects are sets and let $A$ a set, possibly not in this category. We can construct a new category, $\mathbf{C}_{A \to X}$ whose objects are set-maps

$\displaystyle f: A \to X$

and whose morphisms are commutative diagrams of the form

where $\varphi$ is a morphism in $\mathbf{C}$. In other words, we simply require that $\varphi(f_1(a)) = f_2(a)$ for every $a \in A$.

By saying that the free monoid on $A$ satisfies this universal property for the category of monoids, we really mean that it is initial in this category of set-maps and commutative diagrams. That is, there is a monoid $F(A)$ and a set-map $i_A: A \to F(A)$ so that for every monoid $Y$ and set-map $f: A \to Y$ there is a unique monoid homomorphism from $i_A$ to $f$. In other words, there is a unique monoid homomorphism $\varphi$ making the following diagram commute:

For example, if $A$ is the set of all unicode characters, then $F(A)$ is precisely the monoid of all strings on those characters. The map $i_A$ is then the map taking a character $a$ to the single-character string “$a$”. More generally, if $T$ is any type in the category of types and computable functions, then $F(T)$ is the type of homogeneous lists whose elements have type $T$.

We will now restrict our attention to lists and types, and we will denote the free (list) monoid on a type $A$ as $[A]$. The universal property of map is then easy to describe, it’s just a specific instance of this more general universal property for the free monoids. Specifically, we work in the sub-category of homogeneous list monoids (we only allow objects which are $[T]$ for some type $T$). The morphism $i_A: A \to [A]$ just takes a value $a$ of type $A$ and spits out a single-item list $[a]$. We might hope that for any mapping $A \to [B]$, there is a unique monoid homomorphism $[A] \to [B]$ making the following diagram commute.

And while this is true, the diagram lies because “map” does not achieve what $\varphi$ does. The map $f$ might send all of $A$ to the empty list, and this would cause $\varphi$ to be the trivial map. But if we decomposed $f$ further to require it to send $a$ to a single $b$, such as

then things work out nicely. In particular, specifying a function $f: A \to B$ uniquely defines how a list homomorphism operates on lists. In particular, the diagram forces the behavior for single-element lists: $[a] \mapsto [f(a)]$. And the property of $\varphi$ being a monoid homomorphism forces $\varphi$ to preserve order.

Indeed, we’ve learned from this analysis that the structure of the list involved is crucial in forming the universal property. Map is far from the only computable function making the diagram commute, but it is clearly the only monoid homomorphism. As we’ll see, specifying the order of operations is more generally described by fold, and we’ll be able to restate the universal property of map without relying on monoid homomorphisms.

## Filter

The filter function is a small step up in complexity from map, but its universal property is almost exactly the same. Again filter can be viewed through the lens of a universal property for list monoids, because filter itself takes a predicate and produces a list monoid. We know this because filtering two lists by the same predicate and concatenating them is the same as concatenating them first and then filtering them. Indeed, the only difference here is that the diagram now looks like this

where $p$ is our predicate, and $j_A$ is defined by $(a, T) \mapsto [a]$ and $(a,F) \mapsto []$. The composition $j_A \circ p$ gives the map $A \to [A]$ that we can extend uniquely to a monoid homomorphism $[A] \to [A]$. We won’t say any more on it, except to mention that again maintaining the order of the list is better described via fold.

## Fold, and its Universal Property

Fold is different from map and filter in a very concrete way. Even though map and filter do specify that the order of the list should be preserved, it’s not an important part of their definition: filter should filter, and map should map, but no interaction occurs between different entries during the computation. In a sense, we got lucky that having everything be monoid homomorphisms resulted in preserving order without our specific intention.

Because it doesn’t necessarily produce lists and the operation can be quite weird, fold cannot exist without an explicit description of the order of operations. Let’s recall the definition of fold, and stick to the “right associative” order of operations sometimes specified by calling the function “foldr.” We will use foldr and fold interchangeably.

fun foldr(_, v, nil) = v


Even more, we can’t talk about foldr in terms of monoids. Even after fixing $f$ and $v$, foldr need not produce a monoid homomorphism. So if we’re going to find a universal property of foldr, we’ll need a more general categorical picture.

One first try would be to view foldr as a morphism of sets

$\displaystyle B \times \textup{Hom}(A \times B, B) \to \textup{Hom}([A], B)$

The mapping is just $f,b \mapsto \textup{foldr } f \textup{ } b$, and this is just the code definition we gave above. One might hope that this mapping defines an isomorphism in the category of types and programs, but it’s easy to see that it does not. For example, let

$A = \left \{ 1 \right \}, B = \left \{ 1,2 \right \}$

Then the left hand side of the mapping above is a set of size 8 (there are eight ways to combine a choice of element in $B$ with a map from $A \times B \to B$). But the right hand size is clearly infinite. In fact, it’s uncountably infinite, though not all possible mappings are realizable in programs of a reasonable length (in fact, very few are). So the morphism can’t possibly be a surjective and hence is not an isomorphism.

So what can we say about fold? The answer is a bit abstract, but it works out nicely.

Say we fix a type for our lists, $A$. We can define a category $\mathbf{C}_A$ which has as objects the following morphisms

By $1$ we mean the type with one value (null, if you wish), and $f$ is a morphism from a coproduct (i.e. there are implicit parentheses around $A \times B$). As we saw in our post on universal properties, a morphism from a coproduct is the same thing as a pair of functions which operates on each piece. Here one operates on $1$ and the other on $A \times B$. Since morphisms from $1$ are specified by the image of the sole value $f(1) = b$, we will often write such a function as $b \amalg h$, where $h: A \times B \to B$.

Now the morphisms in $\mathbf{C}_A$ are the pairs of morphisms $\varphi, \overline{\varphi}$ which make the following diagram commute:

Here we write $\overline{\varphi}$ because it is determined by $\varphi$ in a canonical way. It maps $1 \to 1$ in the only way one can, and it maps $(a,b) \mapsto (a, \varphi(b))$. So we’re really specifying $\varphi$ alone, but as we’ll see it’s necessary to include $\overline{\varphi}$ as well; it will provide the “inductive step” in the definition of fold.

Now it turns out that fold satisfies the universal property of being initial in this category. Well, not quite. We’re saying first that the following object $\mathscr{A}$ is initial,

Where cons is the list constructor $A \times [A] \to [A]$ which sends $(a_0, [a_1, \dots, a_n]) \mapsto [a_0, a_1, \dots, a_n]$. By being initial, we mean that for any object $\mathscr{B}$ given by the morphism $b \amalg g: 1 \coprod A \times B \to B$, there is a unique morphism from $\mathscr{A} \to \mathscr{B}$. The “fold” is precisely this unique morphism, which we denote by $(\textup{fold g, b})$. We implicitly know its “barred” counterpart making the following diagram commute.

This diagram has a lot going on in it, so let’s go ahead and recap. The left column represents an object $\mathscr{A}$ we’re claiming is initial in this crazy category we’ve made. The right hand side is an arbitrary object $\mathscr{B}$, which is equivalently the data of an element $b \in B$ and a mapping $g: A \times B \to B$. This is precisely the data needed to define fold. The dashed lines represent the unique morphisms making the diagram commute, whose existence and uniqueness is the defining property of what it means for an object to be initial in this category. Finally, we’re claiming that foldr, when we fix $g$ and $b$, makes this diagram commute, and is hence the very same unique morphism we seek.

To prove all of this, we need to first show that the object $\mathscr{A}$ is initial. That is, that any two morphisms we pick from $\mathscr{A} \to \mathscr{B}$ must be equal. The first thing to notice is that the two objects $1 \coprod A \times [A]$ and $[A]$ are really the same thing. That is, $\textup{nil} \amalg \textup{cons}$ has a two-sided inverse which makes it an isomorphism in the category of types. Specifically, the inverse is the map $\textup{split}$ sending

$\textup{nil} \mapsto \textup{nil}$
$[a_1, \dots, a_n] \mapsto (a_1, [a_2, \dots, a_n])$

So if we have a morphism $\varphi, \overline{\varphi}$ from $\mathscr{A} \to \mathscr{B}$, and the diagram commutes, we can see that $\varphi = (b \amalg g) \circ \overline{\varphi} \circ \textup{split}$. We’re just going the long way around the diagram.

Supposing that we have two such morphisms $\varphi_1, \varphi_2$, we can prove they are equal by induction on the length of the input list. It is trivial to see that they both must send the empty list to $b$. Now suppose that for lists of length $n-1$ the two are equal. Given a list $[a_1, \dots, a_n]$ we see that

$\displaystyle \varphi_1([a_1, \dots, a_n]) = \varphi_1 \circ \textup{cons} (a_1, [a_2, \dots, a_n]) = g \circ \overline{\varphi_1} (a_1, [a_2, \dots, a_n])$

where the last equality holds by the commutativity of the diagram. Continuing,

$\displaystyle g \circ \overline{\varphi_1} (a_1, [a_2, \dots, a_n]) = g (a_1, \varphi_1([a_2, \dots, a_n])) = g(a_1, \varphi_2(a_2, \dots, a_n))$

where the last equality holds by the inductive hypothesis. From here, we can reverse the equalities using $\varphi_2$ and it’s “barred” version to get back to $\varphi_2([a_1, \dots, a_n])$, proving the equality.

To show that fold actually makes the diagram commute is even simpler. In order to make the diagram commute we need to send the empty list to $b$, and we need to inductively send $[a_1, \dots, a_n]$ to $g(a_1, (\textup{fold g b})([a_2, \dots, a_n]))$, but this is the very definition of foldr!

So we’ve established that fold has this universal property. It’s easy now to see how map and filter fit into the picture. For mapping types $A$ to $C$ via $f$, just use the type $[C]$ in place of $B$ above, and have $g(a, L) = \textup{cons}(f(a), L)$, and have $b$ be the empty list. Filter similarly just needs a special definition for $g$.

A skeptical reader might ask: what does all of this give us? It’s a good question, and one that shouldn’t be taken lightly because I don’t have an immediate answer. I do believe that with some extra work we could use universal properties to give a trivial proof of the third homomorphism theorem for lists, which says that any function expressible as both a foldr and a foldl can be expressed as a list homomorphism. The proof would involve formulating a universal property for foldl, which is very similar to the one for foldr, and attaching the diagrams in a clever way to give the universal property of a monoid homomorphism for lists. Caveat emptor: this author has not written such a proof, but it seems likely that it would work.

More generally, any time we can represent the requirements of a list computation by an object like $\mathscr{B}$, we can represent the computation as a foldr. What good does this do us? Well, it might already be obvious when you can and can’t use fold. But in addition to proving when it’s possible to use fold, this new perspective generalizes very nicely to give us a characterization of arbitrary computations on compound data types. One might want to know when to perform fold-like operations on trees, or other sufficiently complicated beasts, and the universal property gives us a way to see when such a computation is possible.

That’s right, I said it: there’s more to the world than lists. Shun me if you must, but I will continue dream of great things.

In an effort to make the egregiously long posts on this blog slightly shorter, we’ll postpone our generalization of the universal property of fold until next time. There we’ll define “initial algebras” and show how to characterize “fold-like” computations any compound data type.

Until then!

# A Sample of Standard ML, the TreeSort Algorithm, and Monoids

In this post we will assume the reader has a passing familiarity with some of the basic concepts of functional programming (the map, fold, and filter functions). We introduce these topics in our Racket primer, but the average reader will understand the majority of this primer without expertise in functional programming.

Follow-ups to this post can be found in the Computational Category Theory section of the Main Content page.

## Preface: ML for Category Theory

A few of my readers have been asking for more posts about functional languages and algorithms written in functional languages. While I do have a personal appreciation for the Haskell programming language (and I plan to do a separate primer for it), I have wanted to explore category theory within the context of programming for quite a while now. From what I can tell, ML is a better choice than Haskell for this.

Part of the reason is that, while many Haskell enthusiasts claim it to be a direct implementation of category theory, Haskell actually tweaks category theoretic concepts in certain ways. I rest assured that the designers of Haskell (who are by assumption infinitely better at everything than I am) have very good reasons for doing this. But rather than sort through the details of the Haskell language specification to clarify the finer details, we would learn a lot more by implementing category theory by hand in a programming language that doesn’t have such concepts already.

And so we turn to ML.

ML, which stands for MetaLanguage, apparently has historical precedents for being a first in many things. One of these is an explicit recognition of parametric polymorphism, which is the idea that an operation can have the same functionality regardless of the types of the data involved; the types can, in effect, be considered variables. Another ground-breaking aspect of ML is an explicit type inference system. Similar to Haskell, an ML program will not run unless the compiler can directly prove that the program produces the correct types in every step of the computation.

Both of these features are benefits for the student of category theory. Most of our time in category theory will be spent working with very general assumptions on the capabilities of our data involved, and parametric polymorphism will be our main tool for describing what these assumptions are and for laying out function signatures.

As a side note, I’ve noticed through my ever-growing teaching experiences that one of the main things new programming students struggle with (specifically, after mastering the syntax and semantics of basic language constructs) is keeping their types straight. This is especially prominent in a language like Python (which is what I teach), where duck-typing is so convenient that it lulls the students into a false sense of security. Sooner as opposed to later they’ll add strings to numbers with the blind confidence that Python will simply get it. Around this time in their first semester of programming, I would estimate that type errors lie at the heart of 75% of the bugs my students face and fail to resolve before asking me for help. So one benefit of programming in ML for pedagogy is that it is literally impossible to make type errors. The second you try to run a program with bad types, the compiler points out what the expected type is and what the given (incorrect) type was. It takes a while to get used to type variables (and appeasing the type checker when you want to play fast and loose). But once you do you’ll find the only bugs that remain in your code are conceptual ones, which are of course much more rewarding and important bugs to fix.

So enough of this preamble. Let’s learn some ML!

## Warming Up: Basic Arithmetic and Conditionals

As one would expect, ML has variables and arithmetic which work in much the same way as other languages. Each variable declaration is prefixed by the word “val,” as below

val x = 7;
val y = 2;

This statements modify the global environment (the list of which variable names are associated to which values). Semicolons are required to terminate variable declarations at the global level. We can declare multiple variables in a single line using the “and” keyword

val x = 7 and y = 2;

As a precaution, “and” is only used in ML for syntactic conjunctions of variable/function declarations, and is only necessary when the two defined variable names are mutually defined in terms of each other (this can happen naturally for recursive function definitions). We will see in a moment that the logical and operation is denoted “andalso.”

We can also use pattern matching to bind these two variables in one line, much the same way as it might work in Python:

val (x,y) = (7,2);

We note that while ML does not require us to specify the type of a variable, the type is known and ever present under the surface. If we run the above code through the sml compiler (which after running the contents of a file opens a REPL to further evaluate commands), we see the following output

[opening vars.sml]
val x = 7 : int
val n = 2 : int

The environment is printed out to the user, and it displays that the two types are “int.”

Arithmetic is defined for integers, and the standard ones we will use are the expected +, -, *, and (not a slash, but) div. Here are some examples, and here is a list of all basic operations on ints. A few things to note: the unary negation operator is a tilde (~), and the semicolons are only used terminate statements in the REPL, which tells the compiler we’re ready for it to evaluate our code. Semicolons can also be used to place multiple statements on a single line of code. The “it” variable is a REPL construct which saves the most recent unbound expression evaluation.

- 3 + 6;
val it = 9 : int
- 6 div 3;
val it = 2 : int
- 2 * 9;
val it = 18 : int
- 2 - 9;
val it = ~7 : int
- ~9;
val it = ~9 : int

ML also has floating point arithmetic (in ML this type is called “real”), but treats it in a very prudish manner. Specifically (and this is a taste of the type checker doing its job too well), ML does not coerce types for you. If you want to multiply a real number and an integer, you have to first convert the int to a real and then multiply. An error will occur if you do not:

- val x = 4.0;
val x = 4.0 : real
- val y = 7;
val y = 7 : int
- x * y;
stdIn:5.1-5.6 Error: operator and operand don't agree [tycon mismatch]
operator domain: real * real
operand:         real * int
in expression:
x * y
- x * Real.fromInt(y);
val it = 28.0 : real

Here is a list of all operations on reals. We don’t anticipate using reals much, but it’s good to know that ML fervently separates them.

It seems odd that we’re talking so much about statements, because often enough we will be either binding function names (and tests) to the global environment, or restricting ourselves to local variable declarations. The latter has a slightly more complicated syntax, simply surrounding your variable declarations and evaluated code in a “let … in … end” expression. This will be a much more common construction for us.

let
val x = 7
val y = 9
in
(x + 2*y) div 3
end

The “in” expression is run with the combined variables from the ambient environment (the variables declared outside of the let) and those defined inside the let. The variables defined in the let leave scope after the “in” expression is evaluated, and the entire let expression as a whole evaluates to the result of evaluating the “in” expression. Clearly and example shows what is going on much more directly than words.

The last major basic expression form are the boolean expressions and operations. The type for booleans in ML is called “bool,” and the two possible values are “true,” and “false.” They have the usual unary and binary operations, but the names are a bit weird. Binary conjunction is called “andalso,” while binary disjunction is called “orelse.”

val a = true and b = false;
val c = (a andalso b) orelse ((not a) andalso (not b));

But aside from that, boolean expressions work largely as one would expect. There are the six standard numerical comparison functions, where testing for equality is given by a single equals sign (in most languages, comparison for equality is ==), and inequality is given by the diamond operator <>. The others are, as usual, <, <=, >, >=.

- 6 = 7;
val it = false : bool
- 6 = 6;
val it = true : bool
- 6 < 7;
val it = true : bool
- 7 <= 6;
val it = false : bool
- 6 <> 7;
val it = true : bool

ML also has the standard if statement, which has the following syntax, which is more or less the same as most languages:

- val x = if 6 < 7 then ~1 else 4;
val x = ~1 : int

ML gives the programmer more or less complete freedom with whitespace, so any of these expressions can be spread out across multiple lines if the writer desires.

val x = if 6 < 7
then
~1
else
4

This can sometimes be helpful when defining things inside of let expressions inside of function definitions (inside of other function definitions, inside of …).

So far the basics are easy, if perhaps syntactically different from most other languages we’re familiar with. So let’s move on to the true heart of ML and all functional programming languages: functions.

## Functions and cases, recursion

Now that we have basic types and conditions, we can start to define some simple functions. In the global environment, functions are defined the same way as values, using the word “fun” in the place of “val.” For instance, here is a function that adds 3 to a number.

fun add3(x) = x+3

The left hand side is the function signature, and the right hand side is the body expression. As in Racket, and distinct from most imperative languages, a function evaluates to whatever the body expression evaluates to. Calling functions has two possible syntaxes:

add3(5)
(add3 5)

In other words, if the function application is unambiguous, parentheses aren’t required. Otherwise, one can specify precedence using parentheses in either Racket (Lisp) style or in standard mathematical style.

The most significant difference between ML and most other programming languages, is that ML’s functions have case-checking. That is, we can specify what action is to be taken based on the argument, and these actions are completely disjoint in the function definition (no if statements are needed).

For instance, we could define an add3 function which nefariously does the wrong thing when the user inputs 7.

fun add3(7) = 2
| add3(x) = x+3

The vertical bar is read “or,” and the idea is that the possible cases for the function definition must be written in most-specific to least-specific order. For example, interchanging the orders of the add3 function cases gives the following error:

- fun add3(x) = x+3
stdIn:13.5-14.16 Error: match redundant
x => ...
-->   7 => ...

Functions can call themselves recursively, and this is the main way to implement loops in ML. For instance (and this is quite an inefficient example), I could define a function to check whether a number is even as follows.

fun even(0) = true
| even(n) = not(even(n-1))

Don’t cringe too visibly; we will see recursion used in less horrifying ways in a moment.

Functions with multiple arguments are similarly easy, but there are two semantic possibilities for how to define the arguments. The first, and simplest is what we would expect from a typical language: put commas in between the arguments.

fun add(x,y) = x+y

When one calls the add function, one is forced to supply both arguments immediately. This is usually how programs are written, but often times it can be convenient to only supply one argument, and defer the second argument until later.

If this sounds like black magic, you can thank mathematicians for it. The technique is called currying, and the idea stems from the lambda calculus, in which we can model all computation using just functions (with a single argument) as objects, and function application. Numbers, arithmetic, lists, all of these things are modeled in terms of functions and function calls; the amazing thing is that everything can be done with just these two tools. If readers are interested, we could do a post or two on the lambda calculus to see exactly how these techniques work; the fun part would be that we can actually write programs to prove the theorems.

Function currying is built-in to Standard ML, and to get it requires a minor change in syntax. Here is the add function rewritten in a curried style.

fun add(x)(y) = x+y

Now we can, for instance, define the add3 function in terms of add as follows:

val add3 = add(3)

And we can curry the second argument by defining a new function which defers the first argument appropriately.

fun add6(x) = add(x)(6)

Of course, in this example addition is commutative so which argument you pick is useless.

We should also note that we can define anonymous functions as values (for instance, in a let expression) using this syntax:

val f = (fn x => x+3)

The “fn x => x+3” is just like a “lambda x: x+3” expression in Python, or a “(lambda (x) (+ x 3))” in Racket. Note that one can also define functions using the “fun” syntax in a let expression, so this is truly only for use-once function arguments.

## Tuples, Lists, and Types

As we’ve discovered already, ML figures out the types of our expressions for us. That is, if we define the function “add” as above (in the REPL),

- fun add(x,y) = x+y;
val add = fn : int * int -> int

then ML is smart enough to know that “add” is to accept a list of two ints (we’ll get to what the asterisk means in a moment) and returns an int.

The curried version is similarly intuited:

- fun add(x)(y) = x+y;
val add = fn : int -> int -> int

The parentheses are implied here: int -> (int -> int). So that this is a function which accepts as input an int, and produces as output another function which accepts an int and returns an int.

But, what if we’d like to use “add” to add real numbers? ML simply won’t allow us to; it will complain that we’re providing the wrong types. In order to make things work, we can tell ML that the arguments are reals, and it will deduce that “+” here means addition on reals and not addition on ints. This is one awkward thing about ML; while the compiler is usually able to determine the most general possible type for our functions, it has no general type for elements of a field, and instead defaults to int whenever it encounters arithmetic operations. In any case, this is how to force a function to have a type signature involving reals:

- fun add(x:real, y:real) = x+y;
val add = fn : real * real -> real

If we’re going to talk about types, we need to know all of ML’s syntax for its types. Of course there are the basics (int, real, bool). Then there are function types: int -> int is the type of a function which accepts one int and returns an int.

We’ll see two new types in this section, and the first is the tuple. In ML, tuples are heterogeneous, so we can mix types in them. For instance,

- val tup = (true, 7, 4.4);
val tup = (true,7,4.4) : bool * int * real

Here the asterisk denotes the tuple type, and one familiar with set theory can think of a tuple as an element of the product of sets, in this case

$\displaystyle \left \{ \textup{true}, \textup{false} \right \} \times \mathbb{Z} \times \mathbb{R}$

Indeed, there is a distinct type for each possible kind of tuple. A tuple of three ints (int * int * int) is a distint type from a tuple of three booleans (bool * bool * bool). When we define a function that has multiple arguments using the comma syntax, we are really defining a function which accepts as input a single argument which is a tuple. This parallels exactly how functions on multiple arguments work in classical mathematics.

The second kind of compound type we’ll use quite often is the list. Lists are distinct from tuples in ML in that lists must be homogenous. So a list of integers (which has type “int list”) is different from a list of booleans.

The operations on lists are almost identical as in Haskell. To construct explicit lists use square brackets with comma-delimited elements. To construct them one piece at a time, use the :: list constructor operation. For those readers who haven’t had much experience with lists in functional programming: all lists are linked lists, and the :: operation is the operation of appending a single value to the beginning of a given list. Here are some examples.

- val L = [1,2,3];
val L = [1,2,3] : int list

- val L = 1::(2::(3::nil));
val L = [1,2,3] : int list

The “nil” expression is the empty list, as is the empty-square-brackets expression “[]”.

There is a third kind of compound type called the record, and it is analogous to a C struct, where each field is named. We will mention this in the future once we have a need for it.

The most important thing about lists and tuples in ML is that functions which operate on them don’t always have an obvious type signature. For instance, here is a function which takes in a tuple of two elements and returns the first element.

fun first(x,y) = x

What should the type of this function be? We want it to be able to work with any tuple, no matter the type. As it turns out, ML was one of the first programming language to allow this sort of construction to work. The formal name is parametric polymorphism. It means that a function can operate on many different kinds of types (because the actual computation is the same regardless of the types involved) and the full type signature can be deduced for each function call based on the parameters for that particular call.

The other kind of polymorphism is called ad-hoc polymorphism. Essentially this means that multiple (very different) operations can have the same name. For instance, addition of integers and addition of floating point numbers require two very different sets of instructions, but they both go under the name of +.

What ML does to make its type system understand parametric polymorphism is introduce so-called type variables. A type variable is any string prefixed by a single quote, e.g. ‘a, and they can represent any type. When ML encounters a function with an ambiguous type signature, it decides what the most general possible type is (which usually involves a lot of type variables), and then uses that type.

So to complete our example, the first function has the type

'a * 'b -> 'a

As a side note, the “first” function is in a sense the “canonical” operation that has this type signature. If nothing is known about the types, then no other action can happen besides the projection. There are some more interesting things to be said about such canonical operations (for instance, could we get away with not having to even define them at all).

The analogous version for lists is as follows. Note that in order to decompose a list into its first element and the tail list, we need to use pattern matching.

fun listFirst([x]) = x
| listFirst(head::tail) = head

And this function has the type signature ‘a list -> ‘a. As a slightly more complicated example (where we need recursion), we can write a function to test for list membership.

fun member(x, nil) = false
else member(x, tail)

If you run this program and see some interesting warning messages, see this StackOverflow question for a clarification.

## Defining New Types

The simplest way to define a new type is to just enumerate all possibilities. For instance, here is an enumerated datatype with four possibilities.

datatype maths = algebra | analysis | logic | computation

Then we can define functions which operate on those types using pattern matching.

fun awesome(algebra) = true
| awesome(analysis) = false
| awesome(logic) = false
| awesome(computation) = true

And this function has type maths -> bool (don’t take it too seriously :-)). We can also define data types whose constructors require arguments.

datatype language = functional of string*bool
| imperative of string
| other

Here we define a language to be functional or imperative. The functional type consists of a name and a boolean representing whether it is purely functional, while the imperative type just consists of a name. We can then construct these types by treating the type constructors as if they were functions.

val haskell = functional("Haskell", true) and
java = imperative("Java") and
prolog = other;

Perhaps more useful than this is to define types using type variables. A running example we will use for the remainder of this post is a binary tree of homogeneous elements at each node. Defining such types is easy: all one needs to do is place the appropriate type (with parentheses to clarify when the description of a type starts or ends) after the “of” keyword.

datatype 'a Tree = empty
| leaf of 'a
| node of (('a Tree) * 'a * ('a Tree))

We can create instances of an integer tree as expected:

val t2 = node(node(leaf(2), 3, leaf(4)), 6, leaf(8))

## The TreeSort Algorithm

We can define a host of useful operations on trees (of any type). For instance, below we compute the breadth of a tree (the total number of leaves), the depth of a tree (the maximal length of a path from the root to a leaf), and the ability to flatten a tree (traverse the tree in order and place all of the values into a list). These first two are a nice warm-up.

fun breadth(empty) = 0
| breadth(node(left, _, right)) = breadth(left) + breadth(right)

Here the underscore indicates a pattern match with a variable we won’t actually use. This is more space efficient for the compiler; it can avoid adding extra values to the current environment in a potentially deep recursion.

fun depth(empty) = 0
| depth(leaf(_)) = 1
| depth(node(left, _, right)) =
let
val (lDepth, rDepth) = (1 + depth(left), 1 + depth(right))
in
if lDepth > rDepth then lDepth else rDepth
end

This function should be self explanatory.

fun flatten(empty) = []
| flatten(leaf(x)) = [x]
| flatten(node(left, x, right)) =
flatten(left) @ (x :: flatten(right))

Here the @ symbol is list concatenation. This is not quite the most efficient way to do this (we are going to write a forthcoming post about tail-call optimization, and there we will see why), but it is certainly the clearest. In the final recursive call, we traverse the left subtree first, flattening it into a list in order. Then we flatten the right hand side, and put the current node’s element in between the two flattened parts.

Note that if our tree is ordered, then flatten will produce a strictly increasing list of the elements in the tree. For those readers unfamiliar with ordered binary trees, for all intents and purposes this is an “int Tree” and we can compare the values at different nodes. Then an ordered binary tree is a tree which satisfies the following property for each node: all of the values in the left child’s subtree are strictly smaller than the current node’s value, and all of the values in the right child’s subtree are greater than or equal to the current node’s value.

Indeed, we can use the flatten function as part of a simple algorithm to sort lists of numbers. First we insert the numbers in the unsorted list into a tree (in a way that preserves the ordered property at each step), and then we flatten the tree into a sorted list. This algorithm is called TreeSort, and the insert function is simple as well.

fun insert(x, empty) = leaf(x)
| insert(y, leaf(x)) = if x <= y
then node(empty, x, leaf(y))
else node(leaf(y), x, empty)
| insert(y, node(left, x, right)) =
if x <= y
then node(left, x, insert(y, right))
else node(insert(y, left), x, right)

If we’re at a nonempty node, then we just recursively insert into the appropriate subtree (taking care to create new interior nodes if we’re at a leaf). Note that we do not do any additional computation to ensure that the tree is balanced (that each node has approximately as many children in its left subtree as in its right). Doing so would digress from the point of this primer, but rest assured that the problem of keeping trees balanced has long been solved.

Now the process of calling insert on every element of a list is just a simple application of fold. ML does have the standard map, foldl, foldr, and filter functions, although apparently filter is not available in the standard library (one needs to reference the List module via List.filter).

In any case, foldl is written in the currying style and building the tree is a simple application of it. As we said, the full sorting algorithm is just the result of flattening the resulting tree with our in-order traversal.

fun sort(L) = flatten(foldl(insert)(empty)(L))

So there you have it! One of the simplest (efficient) sorting algorithms I can think of in about twelve lines of code.

## Free Monoid Homomorphisms: A More Advanced Example

Just to get a quick taste of what our series on category theory will entail, let’s write a program with a slightly more complicated type signature. The idea hinges on the idea that lists form a what’s called a free monoid. In particular,

Definition: monoid is a set $X$ equipped with an associative binary operation $\cdot: X \times X \to X$ and an identity element $1 \in X$ for which $x1 = 1x = x$ for all $x \in X$.

Those readers who have been following our series on group theory will recognize a monoid as a group with less structure (there is no guarantee of inverses for the $\cdot$ operation). The salient fact for this example is that the set of ML values of type $\textup{‘a list}$ forms a monoid. The operation is list concatenation, and the identity element is the empty list. Call the empty list nil and the append operation @, as it is in ML.

More than that, $\textup{‘a list}$ forms a free monoid, and the idea of freeness has multiple ways of realization. One sort of elementary way to understand freeness is that the only way to use the binary operation to get to the identity is to have the two summands both be the identity element. In terms of lists, the only way to concatenate two lists to get the empty list is to have both pieces be empty to begin with.

Another, more mature (more category-theoretical) way to think about freeness is to say is satisfies the following “universal” property. Call $A$ the set of values of type ‘a, and $[A]$ the set of values of type $\textup{‘a list}$, and suppose that $(B, \cdot_B, 1_B)$ is the datum of an arbitrary monoid. The universal property says that if we are given a function $f: A \to B$, and we take the canonical map $g: A \to [A]$ which maps an element $a \in A$ to the single-entry list $[a] \in [A]$, then there is a unique way to extend $f$ to a monoid homomorphism $f^* : [A] \to B$ on lists, so that $f^*g = f$. We have mentioned monoid homomorphisms on this blog before in the context of string metrics, but briefly a monoid homomorphism respects the monoid structure in the sense that (for this example) $f^*(a \textup{ @ } b) = f^*(a) \cdot_B f^*(b)$ no matter what $a, b$ are.

This was quite a mouthful, but the data is often written in terms of a so-called “commutative diagram,” whose general definition we will defer to a future post. The diagram for this example looks like:

The dashed line says we are asserting the existence of $f^*$, and the symbol $\exists !$ says this function exists and is uniquely determined by $f, g$. The diagram “commutes” in the sense that traveling from $A$ to $B$ along $f$ gives you the same computational result as traveling by $g$ and then $f^*$. The reason for the word “universal” will become clear in future posts, but vaguely it’s because the set $[A]$ is a unique “starting place” in a special category.

If this talk is too mysterious, we can just go ahead and prove that $f^*$ exists by writing a program that computes the function transforming $f \mapsto f^*$. We call the function “listMonoidLift” because it “lifts” the function $f$ from just operating on $A$ to the status of a monoid homomorphism. Very regal, indeed.

Part of the beauty of this function is that a number of different list operations (list length, list sum, member tests, map, etc.), when viewed under this lens, all become special cases of this theorem! By thinking about things in terms of monoids, we write less code, and more importantly we recognize that these functions all have the same structural signature. Perhaps one can think of it like parametric polymorphism on steroids.

fun listMonoidLift(f:('a->'b), (combine:(('b * 'b) -> 'b), id:'b)) =
let
fun f'(nil) = id
in
f'
end

Here we specified the types of the input arguments to be completely clear what’s what. The first argument is our function $f$ as in the above diagram, and the second two arguments together form the data of our monoid (the set $B$ is implicitly the collection of types $‘b$ determined at the time of the function call). Now let’s see how the list summation and list length functions can be written in terms of the listMonoidLift function.

fun plus(x, y) = x + y
val sum = listMonoidLift((fn x => x), (plus, 0))
val length = listMonoidLift((fn x => 1), (plus, 0))

The plus function on integers with zero as the identity is the monoid $B$ in both cases (and also happens to be $A$ by coincidence), but in the summation case the function $f$ is the identity and for length it is the constant $1$ function.

As a more interesting example, see how list membership is a lift.

fun member(x) = listMonoidLift((fn y => y = x),
((fn (a,b) => a orelse b), false))

Here the member function is curried; it has type ‘a -> ‘a list -> bool (though it’s a bit convoluted since listMonoidLift is what’s returning the ‘a list -> bool part of the type signature). Here the $B$ monoid is the monoid of boolean values, where the operation is logical “or” and the identity is false. It is a coincidence and a simple exercise to prove that $B$ is a free monoid as well.

Now the mapping $f(y)$ is the test to see if $y$ is the same object as $x$. The lift to the list monoid will compute the logical “or” of all evaluations of $f$ on the values.

Indeed, (although this author hates bringing up too many buzzwords where they aren’t entirely welcome) the monoid lifting operation we’ve just described is closely related to the MapReduce framework (without all the distributed computing parts). Part of the benefit of MapReduce is that the programmer need only define the Map() and Reduce() functions (the heart of the computation) and MapReduce does the rest. What this example shows is that defining the Map() function can be even simpler: one only needs define the function $f$, and Map() is computed as $f^*$ automatically. The Reduce() part is simply the definition of the target monoid $B$.

Just to drive this point home, we give the reader a special exercise: write map as a special case of listMonoidLift. The result (the map function) should have one of the two type signatures:

map : ('a -> 'b) * ('a list) -> 'b list
map : 'a -> 'b -> 'a list -> b' list

As a hint, the target monoid should also be a list monoid.

Part of why this author is hesitant to bring up contemporary software in discussing these ideas is because the ideas themselves are far from contemporary. Burstall and Landin’s, 1969 text Programs and Their Proofs (which this author would love to find a copy of) details this exact reduction and other versions of this idea in a more specific kind of structure called “free algebras.” So MapReduce (minus the computer programs and distributed systems) was a well-thought-out idea long before the internet or massively distributed systems were commonplace.

In any case, we’ll start the next post in this series right off with the mathematical concept of a category. We’ll start slow and detail many examples of categories that show up both in (elementary) mathematics and computing, and then move on to universal properties, functors, natural transformations, and more, implementing the ideas all along the way.

Until then!

# Metrics on Words

We are about to begin a series where we analyze large corpora of English words. In particular, we will use a probabilistic analysis of Google’s ngrams to solve various tasks such as spelling correction, word segmentation, on-line typing prediction, and decoding substitution ciphers. This will hopefully take us on a wonderful journey through elementary probability, dynamic programming algorithms, and optimization.

As usual, the code implemented in this post is available from this blog’s Github page, and we encourage the reader to use the code to implement our suggested exercises. But before we get there, we should investigate some properties of our domain: the set of all finite strings of letters.

## Words, Words, Words.

If we consider a fixed alphabet $\Sigma$ (any set, really, but for our purposes a finite one), we may consider the set $\Sigma^*$ of all finite strings of elements from $\Sigma$, also called words. For example, given $\Sigma = \left \{ u,v,w \right \}$, we have the word $u \cdot w \cdot w \cdot v \cdot u \in \Sigma^*$. Also, we allow the empty word $\varepsilon$ to be a string of length zero. The formal name for the “star” operation is the Kleene star. Most of our work here will be done over the English alphabet of letters $\Sigma = \left \{ a, b, c, \dots , z \right \}$.

As usual, we are looking for some sort of underlying structure. Here the structure is that two words can be concatenated to make a larger string. In the parlance of abstract algebra, $\Sigma^*$ is a monoid with respect to the concatenation operation. If we denote the operation by (pretending it is) multiplication, we write $u \cdot v = uv$, and the monoid structure just means two things. First, the $\cdot$ operation is associative, so that any three words $r,s,t$ satisfy $(r \cdot s) \cdot t = r \cdot (s \cdot t)$. Second, it has an identity element (here the empty word), so that $\varepsilon w = w \varepsilon$ for all words $w$. For computer scientists, these are just natural properties of functions like C’s strcat(), but in mathematics they define the structure of the space of all words. To be completely clear, these two properties (a set with an associative binary operation and an identity element) define a monoid.

We make a few abuses of notation here. In every monoid the operation is a pretend multiplication, so in general we will call it multiplication. We will also write strings (abusively, “products”) as $a^4b^2$, which would formally be $a \cdot a \cdot a \cdot a \cdot b \cdot b$. We lose nothing by the abuses, but gain brevity.

The Kleene starred monoid $\Sigma^*$ has an additional property; it is the free monoid generated by $\Sigma$. This won’t mean anything to the reader who isn’t familiar with universal properties, but it essentially tells us that any word $w \in \Sigma^*$ is uniquely written as a product of letters in $\Sigma$.

Now, the structure we’ve described here is not particularly rich. In fact, free objects are the algebraic objects which are usually “completely understood.” For our purposes the language of abstract algebra is just a mature setting for our discussion, but the concepts we introduce will give an extra perspective on the topic. In other words, as we don’t plan to use any monoids more complicated than the English alphabet free monoid described above, we have no interesting (general) theorems to apply to our activities.

Before we turn to a more concrete setting, we have one more definition. A monoid homomorphism between two monoids $M_1, M_2$ is a function $f : M_1 \to M_2$ which respects the multiplication operations and preserves the identity element. Rigorously, we have that all words $u,v \in M_1$ satisfy $f(uv) = f(u)f(v)$, where the operation on the left side of the equality (before the application of $f$) is multiplication in $M_1$, and the one on the right hand side is multiplication in $M_2$.

One easy example of a monoid homomorphism from our English alphabet free monoid is the length homomorphism. Rigorously, the set of natural numbers $\mathbb{N}$ is a monoid under addition, and the function $\textup{length} : \Sigma^* \to \mathbb{N}$ which assigns to each word its length is a homomorphism of monoids. This is intuitively clear: the length of a concatenation of two words is the sum of the lengths of the pieces.

A more complex example which shows up in functional programming has to do with lists. Let $X, Y$ be two classes of objects of some fixed types, then we may consider $X^*$ as the set of all lists of objects in $X$. This is again a free monoid over $X$ with the operation of list appending and the empty list as the identity. Note that $X$ sits inside $X^*$ in a natural way: each element of $X$ can be considered a list of length one. With this understanding, we may be sloppy in calling the “product” of $x,y \in X$ the list $xy \in X^*$ (note, $X$ itself need not have any operations).

Now for any fixed operation $g : X \to Y$, we may form the map homomorphism $\mu_g: X^* \to Y^*$ inductively as follows:

$\mu_g(\varepsilon) = \varepsilon$
$\mu_g(x_1 \dots x_n) = g(x_1) \mu_g(x_2 \dots x_n))$

This is precisely the map operation defined in our primer on functional programming. We encourage the reader to investigate how to phrase the other two functions (filter and fold) as monoid homomorphisms, or prove it cannot be done (thanks to Matt for pointing out this author’s mistake with regards to that).

Metrics, and String Comparisons

Since our goal is to do things like spelling correction, we are quite interested in strings of letters which are not actually words. We want to be able to tell someone that the word “beleive” is probably a misspelling of the word “believe.” So let us fix our alphabet $\Sigma = \left \{ a, b, c, \dots , z \right \}$ and consider the free monoid $\Sigma^*$. As we have noted, this is the set of all words one could type with the lowercase English alphabet, so it includes all of our egregious typos. It is a simplistic model, since we ignore punctuation, capitalization, and stylistic marks that convey meaning. But it is as good a place as any to start our investigation.

To mathematically describe what it means for a misspelled word to be “almost” the intended word, we need to bring in the concept of a metric. In other words, we want to view our set $\Sigma^*$ as a metric space in which we can measure the distance between any two words (when viewed as a metric space, we will call them points). Then the set of all valid English words $E \subset \Sigma^*$ is a subspace. To correct a misspelled word $w \in \Sigma^*$, we can simply use the closest point in $E$ with respect to the metric.

Of course, the hard part is describing the right metric. But before we get there, we must define a metric so we know what properties to aim for in constructing a metric on words.

Definition: A metric $d : X \times X \to \mathbb{R}$ is a function on a set $X$ which has the following three properties for all $x,y,z \in X$

• $d(x,y) \geq 0$, and $d(x,y) = 0$ if and only if $x = y$.
• $d(x,y) = d(y,x)$
• $d(x,y) + d(y,z) \geq d(x,z)$ (the triangle inequality)

A space $X$ equipped with a fixed metric $d$ is said to be a metric space.

There are plenty of interesting examples of metrics, and we refer the interested reader to Wikipedia, or to any introductory topology text (or the end of a real analysis text). We will focus on the Levenshtein metric.

If we think for a minute we can come up with a list of ways that people make typing mistakes. Sometimes we omit letters (as in diferent), sometimes we add too many letters (e.g., committment), and sometimes we substitute one letter for another (missussippi could be a phonetic error, or a slip of the finger on a qwerty keyboard). Furthermore, we can traverse from one word to another by a sequence of such operations (at worst, delete all letters and then insert the right letters). So it would make sense to take the distance between two words to be the smallest number of such transformations required to turn one word into another.

More rigorously, let $u = u_1 \dots u_k$ be the unique way to write $u$ as a product of letters, and let $v = v_1 \dots v_j$ be the same for $v$. An elementary edit of $u$ is one of the following:

• a deletion: the transformation $u_1 \dots u_i \dots u_k \to u_1 \dots \widehat{u_i} \dots u_k$ for some $1 \leq i \leq k$, where the hat omits omission in the $i$-th spot.
• an insertion: the transformation $u_1 \dots u_k \to u_1 \dots u_i x \dots u_k$ for some $1 \leq i \leq k$, and some letter $x \in \Sigma$.
• a substitution: the transformation $u_1 \dots u_i \dots u_k \to u_1 \dots u_{i-1}xu_{i+1} \dots u_k$ for some $1 \leq i \leq k-1$ and some letter $x$.

Then an edit from $u$ to $v$ is a sequence of elementary edits which begins with $u= u_1 \dots u_k$ and ends in $v= v_1 \dots v_j$. The length of an edit is the number of elementary edits in the sequence. Finally, we define the edit distance between $u$ and $v$, denoted $d(u,v)$, as the length of the shortest edit from $u$ to $v$.

To verify this is a metric, we note that all edits have non-negative length, and the only edit of length zero is the edit which does nothing, so if $d(x,y) = 0$ it follows that $x = y$. Second, we note that edits are symmetric inherently, in that if we have an edit from $x$ to $y$, we may simply reverse the sequence and we have a valid edit from $y$ to $x$. Clearly, the property of being the shortest edit is not altered by reversal.

Last, we must verify the triangle inequality. Let $x,y,z$ be words; we want to show $d(x,z) \leq d(x,y) + d(y,z)$. Take two shortest edits between $x,y$ and $y,z$, and note that their composition is a valid edit from $x$ to $z$. Following our definition, by “compose” we mean combine the two sequences of operations into one sequence in the obvious way. Since this is an edit, its length can be no smaller than the shortest edit from $x$ to $z$, proving the claim.

So $d$ is in fact a metric, and historically it is called Levenshtein’s metric.

## A Topological Aside

Before we get to implementing this metric, we have a few observations to make. First, we note that the shortest edit between two words is far from unique. In particular, the needed substitutions, insertions, and deletions often commute (i.e. the order of operations is irrelevant). Furthermore, instead of simply counting the number of operations required, we could assign each operation a cost, and call the total cost of an edit the sum of the costs of each elementary edit. This yields a large class of different metrics, and one could conceivably think of new operations (combinations of elementary operations) to assign lower costs. Indeed, we will do just that soon enough.

Second, and more interestingly, this metric provides quite a bit of structure on our space. It is a well known fact that every metric induces a topology. In other words, there is a topology generated by the open balls $\left \{ x : d(x,y) < r \right \}$ for all possible radii $r \in \mathbb{R}$ and all centers $y$. We can also characterize the topology from another viewpoint: consider the infinite graph $G$ where each vertex is a word in $\Sigma^*$ and two words have a connecting edge if there exists an elementary edit between them. Then edit distance in $\Sigma^*$ is just the length of a shortest path in $G$, and so the spaces are isometric, and hence homeomorphic (they have identical topologies). Indeed, this is often generalized to the word metric on a group, which is beyond the scope of this post (indeed, we haven’t gotten anywhere close to group theory yet on this blog!).

For those of us unfamiliar with topology or graph theory, we can still imagine geometric notions that get to the intuitive heart of what “induced topology” means for words. For example, we can describe a circle of radius $r$ centered at a word $w$ quite easily: it is just the set of all words whose edit distance from $w$ is exactly $r$. As a concrete example, the circle of radius 1 centered at the word $a$ is

$\left \{ \varepsilon, b, c, \dots , z, aa, ab, ac, \dots , az, ba, ca, \dots , za \right \}$

In fact, any geometric construction that can be phrased entirely in terms of distance has an interpretation in this setting. We encourage the reader to think of more.

## Python Implementation, and a Peek at Dynamic Programming

Of course, what use are these theoretical concerns to us if we can’t use it to write a spell-checker? To actually implement the damn thing, we need a nontrivial algorithm. So now let’s turn to Python.

Our first observation is that we don’t actually care what the edits are, we just care about the number of edits. Since the edits only operate on single characters, we can define the behavior recursively. Specifically, suppose we have two words $u = u_1 \dots u_k$ and $v_1 \dots v_j$. If $u_k = v_j$, we can leave the last characters the same and inductively work with the remaining letters. If not, we find the shortest edit between $u_1 \dots u_{k-1}$ and $v_1 \dots v_{j}$, as if our last operation were a deletion of $u_k$. Similarly, we can inductively find the shortest distance between $u_1 \dots u_k$ and $v_1 \dots v_{j-1}$, as if our last move were an insertion of $v_j$ to the end of $u$. Finally, we could find the shortest distance between $u_1 \dots u_{k-1}$ and $v_1 \dots v_{j-1}$, as if our last move were a substitution of $u_k$ for $v_j$. For the base case, if any word is empty, then the only possible edit is inserting/deleting all the letters in the other word.

Here is precisely that algorithm, written in Python:

def dist(word1, word2):
if not word1 or not word2:
return max(len(word1), len(word2))
elif word1[-1] == word2[-1]:
return dist(word1[:-1], word2[:-1])
else:
return 1 + min(dist(word1[:-1], word2),
dist(word1, word2[:-1]),
dist(word1[:-1], word2[:-1]))

Here the [:-1] syntax indicates a slice of the first $n-1$ characters of an $n$ character string. Note again that as we don’t actually care what the operations are, we can simply assume we’re doing the correct transformation, and just add 1 to our recursive calls. For a proof of correctness, we refer the reader to Wikipedia (Sorry! It’s just a ton of case-checking). We also note that recursion in Python can be extremely slow for large inputs. There is of course a method of building up a cost matrix from scratch which would perform better, but we feel this code is more legible, and leave the performance tuning as an exercise to the reader. For more information on dynamic programming, see this blog’s primer on the subject.

The cautious programmer will note the above algorithm is terribly wasteful! For instance, suppose we’re investigating the distance between $foo$ and $bar$. Through our recursive calls, we’ll first investigate the distance between $fo$ and $bar$, during which we recursively investigate $fo$ versus $ba$. Once that’s finished, we go ahead and investigate the other branch, $foo$ versus $ba$, during which we look at $fo$ versus $ba$ once more, even though we already computed it in the first branch! What’s worse, is that we have a third branch that computes $fo$ versus $ba$ again! Doing a bit of algorithm analysis, we realize that this algorithm is $O(3^{\min(n,m)})$, where $m, n$ are the lengths of the two compared words. Unacceptable!

To fix this, we need to keep track of prior computations. The technical term is memoized recursion, and essentially we want to save old computations in a lookup table for later reference. In mostly-Python:

cache = {}
def memoizedFunction(args):
if args not in cache:
cache[args] = doTheComputation(args)
return cache[args]

To actually implement this, it turns out we don’t need to change the above code at all. Instead, we will use a decorator to modify the function as we wish. Here’s the code, which is essentially an extra layer of indirection applied to the above pseudocode.

def memoize(f):
cache = {}

def memoizedFunction(*args):
if args not in cache:
cache[args] = f(*args)
return cache[args]

memoizedFunction.cache = cache
return memoizedFunction

Here the function memoize() will accept our distance function, and return a new function which encapsulates the memo behavior. To use it, we simply use

def f(x):
...

equivalentButMemoizedFunction = memoize(f)

But luckily, Python gives a nice preprocessor macro to avoid writing this for every function we wish to memoize. Instead, we may simply write

@memoize
def f(x):
...

And Python will make the appropriate replacements of calls to f with the appropriate calls to the memoized function. Convenient! For further discussion, see our post on this technique in the program gallery.

Applying this to our Levenshtein metric, we see an impressive speedup, and a quick analysis shows the algorithm takes $O(nm)$, where $n, m$ are the lengths of the two words being compared. Indeed, we are comparing (at worst) all possible prefixes of the two words, and for each of the $n$ prefixes of one word, we compute a distance to all $m$ prefixes of the other word. The memoization prevents us from doing any computation twice.

To this author, this approach is the most natural implementation, but there are other approaches worth investigating. In particular, Python limits the recursion depth to a few hundred. If we try to compare, say, two DNA sequences, this algorithm will quickly overflow. There are a number of ways to fix this, the most appropriate of which would be tail call optimization (in this author’s humble opinion). Unfortunately, we’d need to tweak the algorithm a bit to put the recursive call in tail position, Python does not support tail call optimization, and manually putting things in continuation-passing style is annoying, obfuscating, and quite ugly. If we decide in the future to do DNA sequence analysis, we will return to this problem.

In the future, we plan to provide another Python primer, with a focus on dynamic algorithms. Other methods for solving this problem will arise there. Indeed, I’m teaching an introductory Python programming course next semester, so this will be a good refresher.

## Transpositions, and Other Enhancements

One other significant kind of typo is a transposition. Often times we type the correct letters in a word, but jumble the order of two words. In a spell checker, we want the word $thier$ to be closer to the word $their$ than it is to the word $cheer$, but with the Levenshtein metric the two pairs have equal distance (two substitutions each). We can enhance the metric by making transpositions have a cost of 1. Historically, this extended metric is called the Damerau-Levenshtein metric. Indeed, Damerau himself gave evidence that transpositions, along with the other three elementary edits, account for over 85% of human typing errors. Then again, that was back in the sixties, and typing has changed in many ways since then (not the least of which is a change in a typist’s vocabulary).

Adding transpositions to the algorithm above seems straightforward, but there are some nontrivial details to consider. For instance, we may first transpose two letters and then insert a new letter between them, as in the transformation from $ta$ to $act$. If we are not careful, we might prohibit such legal transformations in our algorithm. Here is an implementation, which again uses the memoization decorator.

@memoize
def dist2(word1, word2):
if not word1 or not word2:
return max(len(word1), len(word2))
elif word1[-1] == word2[-1]:
return dist2(word1[:-1], word2[:-1])
else:
minDist = 1 + min(dist2(word1[:-1], word2),
dist2(word1, word2[:-1]),
dist2(word1[:-1], word2[:-1]))
# transpositions
if len(word1) &gt; 1 and len(word2) &gt; 1:
if word1[-2] == word2[-1]:
transposedWord1 = word1[:-2] + word1[-1] + word1[-2]
minDist = min(minDist, dist2(transposedWord1[:-1], word2))
if word2[-2] == word1[-1]:
transposedWord2 = word2[:-2] + word2[-1] + word2[-2]
minDist = min(minDist, dist2(word1, transposedWord2[:-1]))
return minDist

Indeed, we must inspect both possible transpositions, and the symmetry of the example above shows the need for both branches. The proof that this extended metric is still a metric and the proof of algorithmic correctness are nearly identical to the plain Levenshtein metric.

So that was fun. Here are some other ideas we leave as exercises to the reader. First, if we allow ourselves to fix a keyboard layout (for many languages with Latin-based alphabets, the standard is qwerty with minor substitutions), we could factor that in to our analysis of letter substitutions and incorrect insertions. For instance, the word $ribies$ is just as close to $rabies$ as it is to $rubies$, but it is less likely the user meant to type the first word, since $u$ is physically closer to $i$ than $a$ is. To implement this, we can modify the above algorithm to accept a look-up table of physical distances (approximations) between keys. Instead of adding 1 in the relevant branches, we can add a cost according to the look-up table. At the coarsest, we could construct a graph with vertices representing letters, edges representing physical adjacencies, and use the shortest graph path in place of physical key distance.

We also note (from our mature vantage point) that this algorithm is not restricted to strings, but can be performed on any free monoid. This includes the example we mentioned earlier of lists. So we could generalize the algorithm to operate on any piece of data which has such an identity element and binary operation, and satisfies the freedom condition. My knowledge of Python is still somewhat limited, but the method for achieving this generalization comes in many names in many languages: in Java it’s interfaces, in C++ it’s templating, in Haskell it’s a typeclass. In Python, there is a fancy thing called duck-typing, and we leave this for our next Python primer.

Next time, we’ll crack open some data files with actual English dictionaries in them, and see what we can do about solving interesting problems with them. Until then!

# A Taste of Racket

## or, How I Learned to Love Functional Programming

We recognize that not every reader has an appreciation for functional programming. Yet here on this blog, we’ve done most of our work in languages teeming with functional paradigms. It’s time for us to take a stand and shout from the digital mountaintops, “I love functional programming!” In fact, functional programming was part of this author’s inspiration for Math ∩ Programming.

And so, to help the reader discover the joys of functional programming, we present an introduction to programming in Racket, with a focus on why functional programming is amazing, and a functional solution to a problem on Project Euler. As usual, the entire source code for the examples in this post is available on this blog’s Github page.

## Lists

Lists are the datatype of choice for functional programmers. In particular, a list is either the empty list or a pair of objects:

list = empty
| (x list)

Here () denotes a pair, the first thing in the pair, “x”, is any object, and the second thing is another list. In Racket, all lists have this form, and they are constructed with a special function called “cons”. For instance, the following program outputs the list containing 7 and 8, in that order.

(cons 7 (cons 8 empty))

The reader will soon get used to the syntax: every function application looks like (function arguments…), including the arithmetic operators.

This paradigm is familiar; in imperative (“normal”) programming, this is called a linked list, and is generally perceived as slower than lists based on array structures. We will see why shortly, but in general this is only true for some uncommon operations.

Of course, Racket has a shorthand for constructing lists, which doesn’t require one to write “cons” a hundred times:

(list 7 9)

gives us the same list as before, without the nested parentheses. Now, if we wanted to add a single element to the front of this list, “cons” is still the best tool for the job. If the variable “my-list” is bound to a list, we would call

(cons 6 my-list)

to add the new element. For lists of things which are just numbers, symbols, or strings , there is an additional shorthand, using the “quote” macro:

'(1 2 3 4 "hello" my-symbol!)

Note that Racket does not require such lists are homogeneous, and it automatically converts the proper types for us.

To access the elements of a list, we only need two functions: first and rest. The “first” function accepts a list and returns the first element in it, while “rest” returns the tail of the list excluding the first element. This naturally fits with the “cons” structure of a list. For instance, if “my-list” is a variable containing (cons 8 (cons 9 empty)), then first and rest act as follows:

> (first my-list)
8
> (rest my-list)
'(9)
> (first (rest my-list))
9

In particular, we can access any element of a list with sufficiently many calls to first and rest. But for most problems this is unnecessary. We are about to discover far more elegant methods for working with lists. This is where functional programming truly shines.

## Double-List, Triple-List, and Sub-List

Suppose we want to take a list of numbers and double each number. If we just have what we know now about lists, we can write a function to do this. The general function definition looks like:

(define (function-name arg1 arg2 ...)
body-expression)

To be completely clear, the return value of a Racket function is whatever the body expression evaluates to, and we are allowed to recursively call the function. Indeed, this is the only way we will ever loop in Racket (although it has some looping constructs, we frown upon their use).

And so the definition for doubling a list is naturally:

(define (double-list my-list)
(if (empty? my-list) empty
(cons (* 2 (first my-list))
(double-list (rest my-list)))))

In words, we have two cases: if my-list is empty, then there is nothing to double, so we return a new empty list (well, all empty lists are equal, so it’s the same empty list). This is often referred to as the “base case.” If my-list is nonempty, we construct a new list by doubling the first element, and then recursively  doubling the remaining list. Eventually this algorithm will hit the end of the list, and ultimately it will return a new list with each element of my-list doubled.

Of course, the mathematicians will immediately recognize induction at work. If the program handles the base case and the induction step correctly, then it will correctly operate on a list of any length!

Indeed, we may test double-list:

> (double-list empty)
'()
> (double-list '(1 2 3 5))
'(2 4 6 10)

And we are confident that it works. Now say we wanted to instead triple the elements of the list. We could rewrite this function with but a single change:

(define (triple-list my-list)
(if (empty? my-list) empty
(cons (* 3 (first my-list))
(triple-list (rest my-list)))))

It’s painfully obvious that coding up two such functions is a waste of time. In fact, at 136 characters, I’m repeating more than 93% of the code (eight characters changing “doub” to “trip” and one character changing 2 to 3). What a waste! We should instead refactor this code to accept a new argument: the number we want to multiply by.

(define (multiply-list my-list n)
(if (empty? my-list) empty
(cons (* n (first my-list))
(multiply-list (rest my-list) n))))

This is much better, but now instead of multiplying the elements of our list by some fixed number, we want to subtract 7 from each element (arbitrary, I know, but we’re going somewhere). Now I have to write a whole new function to subtract things!

(define (sub-list my-list n)
(if (empty? my-list) empty
(cons (- (first my-list) n)
(sub-list (rest my-list) n))))

Of course, we have the insight to make it generic and accept any subtraction argument, but this is the problem we tried to avoid by writing multiply-list! We obviously need to step things up a notch.

## Map

In all of this work above, we’ve only been changing one thing: the operation applied to each element of the list. Let’s create a new function, which accepts as input a list and a function which operates on each element of the list. This special operation is called map, and it is only possible to make because Racket treats functions like any other kind of value (they can be passed as arguments to functions, and returned as values; functions are first class objects).

The implementation of map should look very familiar by now:

(define (map f my-list)
(if (empty? my-list) empty
(cons (f (first my-list))
(map f (rest my-list)))))

In particular, we may now define all of our old functions in terms of map! For instance,

(define (double x) (* 2 x))
(define (double-list2 my-list) (map double my-list))

Or, even better, we may take advantage of Racket’s anonymous functions, which are also called “lambda expressions,” to implement the body of double-list in a single line:

(map (λ (x) (* 2 x)) my-list)

Here the λ character has special binding in the Dr. Racket programming environment (Alt-\), and one can alternatively write the string “lambda” in its place.

With map we have opened quite a large door. Given any function at all, we may extend that function to operate on a list of values using map. Consider the imperative equivalent:

for (i = 0; i < list.length; i++):
list.set(i, f(list.get(i)))

Every time we want to loop over a list, we have to deal with all of this indexing nonsense (not to mention the extra code needed to make a new list if we don’t want to mutate the existing list, and that iterator shortcuts prohibit mutation). And the Racket haters will have to concede, the imperative method has just as many parentheses 🙂

Of course, we must note that map always creates a new list. In fact, in languages that are so-called “purely” functional, it is impossible to change the value of a variable (i.e., there is no such thing as mutation). The advantages and disadvantages of this approach are beyond the scope of this post, but we will likely cover them in the future.

## Fold and Filter

Of course, map is just one kind of loop we might want. For instance, what if we wanted to sum up all of the numbers in a list, or pick out the positive ones? This is where fold and filter come into play.

Fold is a function which reduces a list to a single value. To do this, it accepts a list, an initial value, and a function which accepts precisely two arguments and outputs a single value. It then uses this to combine the elements of a list. It’s implementation is not that different from map:

(define (fold f val my-list)
(if (empty? my-list) val
(fold f
(f val (first my-list))
(rest my-list))))

Here the base case is to simply return “val”, while the induction step is to combine “val” with the first element of the list using “f”, and then recursively apply fold to the remainder of the list. Now we may implement our desired summing function as

(define (sum my-list) (fold + 0 my-list))

And similarly, make a multiplying function:

(define (prod my-list) (fold * 1 my-list))

Notice now that we’ve extracted the essential pieces of the problem: what operation to apply, and what the base value is. In fact, this is the only relevant information to the summing and multiplying functions. In other words, we couldn’t possibly make our code any simpler!

Finally, filter is a function which selects specific values from a list. It accepts a selection function, which accepts one argument and returns true or false, and the list to select from. It’s implementation is again straightforward induction:

(define (filter select? my-list)
(if (empty? my-list) empty
(let ([elt (first my-list)]
[the-rest (rest my-list)])
(if (select? elt)
(cons elt (filter select? the-rest))
(filter select? the-rest)))))

To avoid superfluous calls to “first” and “rest”, we use Racket’s “let” form to bind some variables. Logically, the base case is again to return an empty list, while the inductive step depends on the result of applying “select?” to the first element in our list. If the result is true, we include it in the resulting list, recursively calling filter to look for other elements. If the result is false, we simply skip it, recursively calling filter to continue our search.

Now, to find the positive numbers in a list, we may simply use filter:

(filter (λ (x) (&gt; x 0)) my-list)

Again, the only relevant pieces of this algorithm are the selection function and the list to search through.

There is one important variant of fold, in particular, the function we’re using to fold may depend on the order in which it’s applied to elements of the list. We might require that folding begin at the end of a list instead of the beginning. Fold functions are usually distinguished as left- or right-folds. Of course, Racket has map, fold, and filter built in, but the fold functions are renamed “foldl” and “foldr”. We have implemented foldl, and we leave foldr as an exercise to the reader.

## A Brighter, More Productive World

Any loop we ever want to implement can be done with the appropriate calls to map, fold, and filter. We will illustrate this by solving a Project Euler problem, specifically problem 67. For those too lazy to click a link, the problem is to find the maximal sum of paths from the apex to the base of a triangular grid of numbers. For example, consider the following triangle:

   3
7 4
2 4 6
8 5 9 3

Here the maximal path is 3,7,4,9, whose sum is 23. In the problem, we are provided with a file containing a triangle with 100 rows ($2^{99}$ paths!) and are asked to find the maximal path.

First, we recognize a trick. Suppose that by travelling the optimal route in the triangle above, we arrived in the second to last row at the number 2. Then we would know precisely how to continue: simply choose the larger of the two next values. We may reduce this triangle to the following:

      3
7     4
10   13   15

where we replace the second-to-last row with the sum of the entries and the larger of the two possible subsequent steps. Now, performing the reduction again on this reduced triangle, we get

   3
20   19

And performing the reduction one last time, we arrive at our final, maximal answer of 23.

All we need to do now is translate this into a sequence of maps, folds, and filters.

First, we need to be able to select the “maximum of pairs” of a given row. To do this, we convert a row into a list of successive pairs. Considering the intended audience, this is a rather complicated fold operation, and certainly the hardest part of the whole problem. We will let the reader investigate the code to understand it.

;; row-&gt;pairs: list of numbers -&gt; list of successive pairs of numbers
(define (row-&gt;pairs row)
(if (equal? (length row) 1)
(list row)
(let ([first-pair (list (list (first row) (second row)))]
[remaining-row (rest (rest row))]
[fold-function (λ (so-far next)
(let ([prev-pair (first so-far)])
(cons (list (second prev-pair) next) so-far)))])
(reverse (fold fold-function first-pair remaining-row)))))

All we will say is that we need to change the base case so that it is the first pair we want, and then apply the fold to the remaining elements of the row. This turns a list like ‘(1 2 3 4) into ‘((1 2) (2 3) (3 4)).

Next, we need to be able to determine which of these pairs in a given row are maximal. This is a simple map, which we extend to work not just with pairs, but with lists of any size:

;; maxes: list of lists of numbers -&gt; list of maxes of each list
(define (maxes pairs)
(map (λ (lst) (apply max lst)) pairs))

Next, we combine the two operations into a “reduce” function:

;; reduce: combine maxes with row-&gt;pairs
(define (reduce row) (maxes (row-&gt;pairs row)))

Finally, we have the bulk of the algorithm. We need a helper “zip” function:

;; zip: list of lists -&gt; list
;; turn something like '((1 2 3) (5 6 7)) into '((1 5) (2 6) (3 7))
(define (zip lists)
(if (empty? (first lists)) empty
(cons (map first lists)
(zip (map rest lists))))) 

and the function which computes the maximal path, which is just a nice fold. The main bit of logic is highlighted.

;; max-path: list of rows -&gt; number
;; takes the upside-down triangle and returns the max path
(define (max-path triangle)
(fold (λ (cumulative-maxes new-row)
(reduce (map sum (zip (list new-row cumulative-maxes)))))
(make-list (length (first triangle)) 0)
triangle))

In particular, given the previously computed maxima, and a new row, we combine the two rows by adding the two lists together element-wise (mapping sum over a zipped list), and then we reduce the result. The initial value provided to fold is an appropriately-sized list of zeros, and the rest falls through.

With an extra bit of code to read in the big input file, we compute the answer to be 7273, and eat a cookie.

Of course, we split this problem up into much smaller pieces just to show how map and fold are used. Functions like zip are usually built in to languages (though I haven’t found an analogue in Racket through the docs), and the maxes function would likely not be separated from the rest. The point is: the code is short without sacrificing modularity, readability, or extensibility. In fact, most of the algorithm we came up with translates directly to code! If we were to try the same thing in an imperative language, we would likely store everything in an array with pointers floating around, and our heads would spin with index shenanigans. Yet none of that has anything to do with the actual algorithm!

And that is why functional programming is beautiful.

As usual, the entire source code for the examples in this post is available on this blog’s Github page.

Until next time!

Addendum: Consider, if you will, the following solutions from others who solved the same problem on Project Euler:

C/C++:

#define numlines 100

int main()
{

int** lines = new int*[numlines];
int linelen[numlines];

for(int i=0; i&lt;numlines; i++) linelen[i] = i+1;

ifstream in_triangle(&quot;triangle.txt&quot;);

for (int i=0;i&lt;numlines;i++)
{
lines[i] = new int[i+1];
linelen[i] = i+1;
for (int j=0; j&lt;i+1; j++)
{
in_triangle&gt;&gt;lines[i][j];
}
in_triangle.ignore(1,'\n');
}

int routes1[numlines];
int routes2[numlines];

for (int i=0;i&lt;numlines;i++) routes1[i] = lines[numlines-1][i];

//find the best way
for (int i=numlines-2;i&gt;=0;i--)
{
for(int j=0;j&lt;i+1;j++)
{
if(routes1[j] &gt; routes1[j+1])
{
routes2[j] = routes1[j] + lines[i][j];
} else
{
routes2[j] = routes1[j+1] + lines[i][j];
}
}
for (int i=0;i&lt;numlines;i++) routes1[i] = routes2[i];
}
cout&lt;&lt;&quot;the sum is: &quot;&lt;&lt;routes1[0]&lt;&lt;endl;
}

PHP:

&lt;?php
$cont = file_get_contents(&quot;triangle.txt&quot;);$lines = explode(&quot;\n&quot;,$cont);$bRow = explode(&quot; &quot;,$lines[count($lines)-1]);
for($i=count($lines)-1; $i&gt;0;$i--)
{
$tRow = explode(&quot; &quot;,$lines[$i-1]); for($j=0; $j&lt;count($tRow); $j++) { if($bRow[$j] &gt;$bRow[$j+1])$tRow[$j] +=$bRow[$j]; else$tRow[$j] +=$bRow[$j+1]; } if($i==1)
echo $tRow[0];$bRow = $tRow; } ?&gt; J (We admit to have no idea what is going in programs written in J): {. (+ 0: ,~ 2: &gt;./\ ])/ m Python: import sys l = [[0] + [int(x) for x in line.split()] + [0] for line in sys.stdin] for i in range(1, len(l)): for j in range(1, i+2): l[i][j] += max(l[i-1][j-1], l[i-1][j]) print max(l[-1]) Haskell:  module Main where import IO main = do triangle &lt;- openFile &quot;triangle.txt&quot; ReadMode &gt;&gt;= hGetContents print . foldr1 reduce . map ((map read) . words) . lines$ triangle
where reduce a b = zipWith (+) a (zipWith max b (tail b)) 

Ah, foldr! map! zip! Haskell is clearly functional 🙂 Now there is a lot more going on here (currying, function composition, IO monads) that is far beyond the scope of this post, and admittedly it could be written more clearly, but the algorithm is essentially the same as what we have.