# A Sample of Standard ML, the TreeSort Algorithm, and Monoids

In this post we will assume the reader has a passing familiarity with some of the basic concepts of functional programming (the map, fold, and filter functions). We introduce these topics in our Racket primer, but the average reader will understand the majority of this primer without expertise in functional programming.

## Preface: ML for Category Theory

A few of my readers have been asking for more posts about functional languages and algorithms written in functional languages. While I do have a personal appreciation for the Haskell programming language (and I plan to do a separate primer for it), I have wanted to explore category theory within the context of programming for quite a while now. From what I can tell, ML is a better choice than Haskell for this.

Part of the reason is that, while many Haskell enthusiasts claim it to be a direct implementation of category theory, Haskell actually tweaks category theoretic concepts in certain ways. I rest assured that the designers of Haskell (who are by assumption infinitely better at everything than I am) have very good reasons for doing this. But rather than sort through the details of the Haskell language specification to clarify the finer details, we would learn a lot more by implementing category theory by hand in a programming language that doesn’t have such concepts already.

And so we turn to ML.

ML, which stands for MetaLanguage, apparently has historical precedents for being a first in many things. One of these is an explicit recognition of parametric polymorphism, which is the idea that an operation can have the same functionality regardless of the types of the data involved; the types can, in effect, be considered variables. Another ground-breaking aspect of ML is an explicit type inference system. Similar to Haskell, an ML program will not run unless the compiler can directly prove that the program produces the correct types in every step of the computation.

Both of these features are benefits for the student of category theory. Most of our time in category theory will be spent working with very general assumptions on the capabilities of our data involved, and parametric polymorphism will be our main tool for describing what these assumptions are and for laying out function signatures.

As a side note, I’ve noticed through my ever-growing teaching experiences that one of the main things new programming students struggle with (specifically, after mastering the syntax and semantics of basic language constructs) is keeping their types straight. This is especially prominent in a language like Python (which is what I teach), where duck-typing is so convenient that it lulls the students into a false sense of security. Sooner as opposed to later they’ll add strings to numbers with the blind confidence that Python will simply get it. Around this time in their first semester of programming, I would estimate that type errors lie at the heart of 75% of the bugs my students face and fail to resolve before asking me for help. So one benefit of programming in ML for pedagogy is that it is literally impossible to make type errors. The second you try to run a program with bad types, the compiler points out what the expected type is and what the given (incorrect) type was. It takes a while to get used to type variables (and appeasing the type checker when you want to play fast and loose). But once you do you’ll find the only bugs that remain in your code are conceptual ones, which are of course much more rewarding and important bugs to fix.

So enough of this preamble. Let’s learn some ML!

## Warming Up: Basic Arithmetic and Conditionals

As one would expect, ML has variables and arithmetic which work in much the same way as other languages. Each variable declaration is prefixed by the word “val,” as below

val x = 7;
val y = 2;

This statements modify the global environment (the list of which variable names are associated to which values). Semicolons are required to terminate variable declarations at the global level. We can declare multiple variables in a single line using the “and” keyword

val x = 7 and y = 2;

As a precaution, “and” is only used in ML for syntactic conjunctions of variable/function declarations, and is only necessary when the two defined variable names are mutually defined in terms of each other (this can happen naturally for recursive function definitions). We will see in a moment that the logical and operation is denoted “andalso.”

We can also use pattern matching to bind these two variables in one line, much the same way as it might work in Python:

val (x,y) = (7,2);

We note that while ML does not require us to specify the type of a variable, the type is known and ever present under the surface. If we run the above code through the sml compiler (which after running the contents of a file opens a REPL to further evaluate commands), we see the following output

[opening vars.sml]
val x = 7 : int
val n = 2 : int

The environment is printed out to the user, and it displays that the two types are “int.”

Arithmetic is defined for integers, and the standard ones we will use are the expected +, -, *, and (not a slash, but) div. Here are some examples, and here is a list of all basic operations on ints. A few things to note: the unary negation operator is a tilde (~), and the semicolons are only used terminate statements in the REPL, which tells the compiler we’re ready for it to evaluate our code. Semicolons can also be used to place multiple statements on a single line of code. The “it” variable is a REPL construct which saves the most recent unbound expression evaluation.

- 3 + 6;
val it = 9 : int
- 6 div 3;
val it = 2 : int
- 2 * 9;
val it = 18 : int
- 2 - 9;
val it = ~7 : int
- ~9;
val it = ~9 : int

ML also has floating point arithmetic (in ML this type is called “real”), but treats it in a very prudish manner. Specifically (and this is a taste of the type checker doing its job too well), ML does not coerce types for you. If you want to multiply a real number and an integer, you have to first convert the int to a real and then multiply. An error will occur if you do not:

- val x = 4.0;
val x = 4.0 : real
- val y = 7;
val y = 7 : int
- x * y;
stdIn:5.1-5.6 Error: operator and operand don't agree [tycon mismatch]
operator domain: real * real
operand:         real * int
in expression:
x * y
- x * Real.fromInt(y);
val it = 28.0 : real

Here is a list of all operations on reals. We don’t anticipate using reals much, but it’s good to know that ML fervently separates them.

It seems odd that we’re talking so much about statements, because often enough we will be either binding function names (and tests) to the global environment, or restricting ourselves to local variable declarations. The latter has a slightly more complicated syntax, simply surrounding your variable declarations and evaluated code in a “let … in … end” expression. This will be a much more common construction for us.

let
val x = 7
val y = 9
in
(x + 2*y) div 3
end

The “in” expression is run with the combined variables from the ambient environment (the variables declared outside of the let) and those defined inside the let. The variables defined in the let leave scope after the “in” expression is evaluated, and the entire let expression as a whole evaluates to the result of evaluating the “in” expression. Clearly and example shows what is going on much more directly than words.

The last major basic expression form are the boolean expressions and operations. The type for booleans in ML is called “bool,” and the two possible values are “true,” and “false.” They have the usual unary and binary operations, but the names are a bit weird. Binary conjunction is called “andalso,” while binary disjunction is called “orelse.”

val a = true and b = false;
val c = (a andalso b) orelse ((not a) andalso (not b));

But aside from that, boolean expressions work largely as one would expect. There are the six standard numerical comparison functions, where testing for equality is given by a single equals sign (in most languages, comparison for equality is ==), and inequality is given by the diamond operator <>. The others are, as usual, <, <=, >, >=.

- 6 = 7;
val it = false : bool
- 6 = 6;
val it = true : bool
- 6 < 7;
val it = true : bool
- 7 <= 6;
val it = false : bool
- 6 <> 7;
val it = true : bool

ML also has the standard if statement, which has the following syntax, which is more or less the same as most languages:

- val x = if 6 < 7 then ~1 else 4;
val x = ~1 : int

ML gives the programmer more or less complete freedom with whitespace, so any of these expressions can be spread out across multiple lines if the writer desires.

val x = if 6 < 7
then
~1
else
4

This can sometimes be helpful when defining things inside of let expressions inside of function definitions (inside of other function definitions, inside of …).

So far the basics are easy, if perhaps syntactically different from most other languages we’re familiar with. So let’s move on to the true heart of ML and all functional programming languages: functions.

## Functions and cases, recursion

Now that we have basic types and conditions, we can start to define some simple functions. In the global environment, functions are defined the same way as values, using the word “fun” in the place of “val.” For instance, here is a function that adds 3 to a number.

fun add3(x) = x+3

The left hand side is the function signature, and the right hand side is the body expression. As in Racket, and distinct from most imperative languages, a function evaluates to whatever the body expression evaluates to. Calling functions has two possible syntaxes:

add3(5)
(add3 5)

In other words, if the function application is unambiguous, parentheses aren’t required. Otherwise, one can specify precedence using parentheses in either Racket (Lisp) style or in standard mathematical style.

The most significant difference between ML and most other programming languages, is that ML’s functions have case-checking. That is, we can specify what action is to be taken based on the argument, and these actions are completely disjoint in the function definition (no if statements are needed).

For instance, we could define an add3 function which nefariously does the wrong thing when the user inputs 7.

fun add3(7) = 2
| add3(x) = x+3

The vertical bar is read “or,” and the idea is that the possible cases for the function definition must be written in most-specific to least-specific order. For example, interchanging the orders of the add3 function cases gives the following error:

- fun add3(x) = x+3
stdIn:13.5-14.16 Error: match redundant
x => ...
-->   7 => ...

Functions can call themselves recursively, and this is the main way to implement loops in ML. For instance (and this is quite an inefficient example), I could define a function to check whether a number is even as follows.

fun even(0) = true
| even(n) = not(even(n-1))

Don’t cringe too visibly; we will see recursion used in less horrifying ways in a moment.

Functions with multiple arguments are similarly easy, but there are two semantic possibilities for how to define the arguments. The first, and simplest is what we would expect from a typical language: put commas in between the arguments.

fun add(x,y) = x+y

When one calls the add function, one is forced to supply both arguments immediately. This is usually how programs are written, but often times it can be convenient to only supply one argument, and defer the second argument until later.

If this sounds like black magic, you can thank mathematicians for it. The technique is called currying, and the idea stems from the lambda calculus, in which we can model all computation using just functions (with a single argument) as objects, and function application. Numbers, arithmetic, lists, all of these things are modeled in terms of functions and function calls; the amazing thing is that everything can be done with just these two tools. If readers are interested, we could do a post or two on the lambda calculus to see exactly how these techniques work; the fun part would be that we can actually write programs to prove the theorems.

Function currying is built-in to Standard ML, and to get it requires a minor change in syntax. Here is the add function rewritten in a curried style.

fun add(x)(y) = x+y

Now we can, for instance, define the add3 function in terms of add as follows:

val add3 = add(3)

And we can curry the second argument by defining a new function which defers the first argument appropriately.

fun add6(x) = add(x)(6)

Of course, in this example addition is commutative so which argument you pick is useless.

We should also note that we can define anonymous functions as values (for instance, in a let expression) using this syntax:

val f = (fn x => x+3)

The “fn x => x+3″ is just like a “lambda x: x+3″ expression in Python, or a “(lambda (x) (+ x 3))” in Racket. Note that one can also define functions using the “fun” syntax in a let expression, so this is truly only for use-once function arguments.

## Tuples, Lists, and Types

As we’ve discovered already, ML figures out the types of our expressions for us. That is, if we define the function “add” as above (in the REPL),

- fun add(x,y) = x+y;
val add = fn : int * int -> int

then ML is smart enough to know that “add” is to accept a list of two ints (we’ll get to what the asterisk means in a moment) and returns an int.

The curried version is similarly intuited:

- fun add(x)(y) = x+y;
val add = fn : int -> int -> int

The parentheses are implied here: int -> (int -> int). So that this is a function which accepts as input an int, and produces as output another function which accepts an int and returns an int.

But, what if we’d like to use “add” to add real numbers? ML simply won’t allow us to; it will complain that we’re providing the wrong types. In order to make things work, we can tell ML that the arguments are reals, and it will deduce that “+” here means addition on reals and not addition on ints. This is one awkward thing about ML; while the compiler is usually able to determine the most general possible type for our functions, it has no general type for elements of a field, and instead defaults to int whenever it encounters arithmetic operations. In any case, this is how to force a function to have a type signature involving reals:

- fun add(x:real, y:real) = x+y;
val add = fn : real * real -> real

If we’re going to talk about types, we need to know all of ML’s syntax for its types. Of course there are the basics (int, real, bool). Then there are function types: int -> int is the type of a function which accepts one int and returns an int.

We’ll see two new types in this section, and the first is the tuple. In ML, tuples are heterogeneous, so we can mix types in them. For instance,

- val tup = (true, 7, 4.4);
val tup = (true,7,4.4) : bool * int * real

Here the asterisk denotes the tuple type, and one familiar with set theory can think of a tuple as an element of the product of sets, in this case

$\displaystyle \left \{ \textup{true}, \textup{false} \right \} \times \mathbb{Z} \times \mathbb{R}$

Indeed, there is a distinct type for each possible kind of tuple. A tuple of three ints (int * int * int) is a distint type from a tuple of three booleans (bool * bool * bool). When we define a function that has multiple arguments using the comma syntax, we are really defining a function which accepts as input a single argument which is a tuple. This parallels exactly how functions on multiple arguments work in classical mathematics.

The second kind of compound type we’ll use quite often is the list. Lists are distinct from tuples in ML in that lists must be homogenous. So a list of integers (which has type “int list”) is different from a list of booleans.

The operations on lists are almost identical as in Haskell. To construct explicit lists use square brackets with comma-delimited elements. To construct them one piece at a time, use the :: list constructor operation. For those readers who haven’t had much experience with lists in functional programming: all lists are linked lists, and the :: operation is the operation of appending a single value to the beginning of a given list. Here are some examples.

- val L = [1,2,3];
val L = [1,2,3] : int list

- val L = 1::(2::(3::nil));
val L = [1,2,3] : int list

The “nil” expression is the empty list, as is the empty-square-brackets expression “[]”.

There is a third kind of compound type called the record, and it is analogous to a C struct, where each field is named. We will mention this in the future once we have a need for it.

The most important thing about lists and tuples in ML is that functions which operate on them don’t always have an obvious type signature. For instance, here is a function which takes in a tuple of two elements and returns the first element.

fun first(x,y) = x

What should the type of this function be? We want it to be able to work with any tuple, no matter the type. As it turns out, ML was one of the first programming language to allow this sort of construction to work. The formal name is parametric polymorphism. It means that a function can operate on many different kinds of types (because the actual computation is the same regardless of the types involved) and the full type signature can be deduced for each function call based on the parameters for that particular call.

The other kind of polymorphism is called ad-hoc polymorphism. Essentially this means that multiple (very different) operations can have the same name. For instance, addition of integers and addition of floating point numbers require two very different sets of instructions, but they both go under the name of +.

What ML does to make its type system understand parametric polymorphism is introduce so-called type variables. A type variable is any string prefixed by a single quote, e.g. ‘a, and they can represent any type. When ML encounters a function with an ambiguous type signature, it decides what the most general possible type is (which usually involves a lot of type variables), and then uses that type.

So to complete our example, the first function has the type

'a * 'b -> 'a

As a side note, the “first” function is in a sense the “canonical” operation that has this type signature. If nothing is known about the types, then no other action can happen besides the projection. There are some more interesting things to be said about such canonical operations (for instance, could we get away with not having to even define them at all).

The analogous version for lists is as follows. Note that in order to decompose a list into its first element and the tail list, we need to use pattern matching.

fun listFirst([x]) = x
| listFirst(head::tail) = head

And this function has the type signature ‘a list -> ‘a. As a slightly more complicated example (where we need recursion), we can write a function to test for list membership.

fun member(x, nil) = false
else member(x, tail)

If you run this program and see some interesting warning messages, see this StackOverflow question for a clarification.

## Defining New Types

The simplest way to define a new type is to just enumerate all possibilities. For instance, here is an enumerated datatype with four possibilities.

datatype maths = algebra | analysis | logic | computation

Then we can define functions which operate on those types using pattern matching.

fun awesome(algebra) = true
| awesome(analysis) = false
| awesome(logic) = false
| awesome(computation) = true

And this function has type maths -> bool (don’t take it too seriously :-)). We can also define data types whose constructors require arguments.

datatype language = functional of string*bool
| imperative of string
| other

Here we define a language to be functional or imperative. The functional type consists of a name and a boolean representing whether it is purely functional, while the imperative type just consists of a name. We can then construct these types by treating the type constructors as if they were functions.

val haskell = functional("Haskell", true) and
java = imperative("Java") and
prolog = other;

Perhaps more useful than this is to define types using type variables. A running example we will use for the remainder of this post is a binary tree of homogeneous elements at each node. Defining such types is easy: all one needs to do is place the appropriate type (with parentheses to clarify when the description of a type starts or ends) after the “of” keyword.

datatype 'a Tree = empty
| leaf of 'a
| node of (('a Tree) * 'a * ('a Tree))

We can create instances of an integer tree as expected:

val t2 = node(node(leaf(2), 3, leaf(4)), 6, leaf(8))

## The TreeSort Algorithm

We can define a host of useful operations on trees (of any type). For instance, below we compute the breadth of a tree (the total number of leaves), the depth of a tree (the maximal length of a path from the root to a leaf), and the ability to flatten a tree (traverse the tree in order and place all of the values into a list). These first two are a nice warm-up.

fun breadth(empty) = 0
| breadth(node(left, _, right)) = breadth(left) + breadth(right)

Here the underscore indicates a pattern match with a variable we won’t actually use. This is more space efficient for the compiler; it can avoid adding extra values to the current environment in a potentially deep recursion.

fun depth(empty) = 0
| depth(leaf(_)) = 1
| depth(node(left, _, right)) =
let
val (lDepth, rDepth) = (1 + depth(left), 1 + depth(right))
in
if lDepth > rDepth then lDepth else rDepth
end

This function should be self explanatory.

fun flatten(empty) = []
| flatten(leaf(x)) = [x]
| flatten(node(left, x, right)) =
flatten(left) @ (x :: flatten(right))

Here the @ symbol is list concatenation. This is not quite the most efficient way to do this (we are going to write a forthcoming post about tail-call optimization, and there we will see why), but it is certainly the clearest. In the final recursive call, we traverse the left subtree first, flattening it into a list in order. Then we flatten the right hand side, and put the current node’s element in between the two flattened parts.

Note that if our tree is ordered, then flatten will produce a strictly increasing list of the elements in the tree. For those readers unfamiliar with ordered binary trees, for all intents and purposes this is an “int Tree” and we can compare the values at different nodes. Then an ordered binary tree is a tree which satisfies the following property for each node: all of the values in the left child’s subtree are strictly smaller than the current node’s value, and all of the values in the right child’s subtree are greater than or equal to the current node’s value.

Indeed, we can use the flatten function as part of a simple algorithm to sort lists of numbers. First we insert the numbers in the unsorted list into a tree (in a way that preserves the ordered property at each step), and then we flatten the tree into a sorted list. This algorithm is called TreeSort, and the insert function is simple as well.

fun insert(x, empty) = leaf(x)
| insert(y, leaf(x)) = if x <= y
then node(empty, x, leaf(y))
else node(leaf(y), x, empty)
| insert(y, node(left, x, right)) =
if x <= y
then node(left, x, insert(y, right))
else node(insert(y, left), x, right)

If we’re at a nonempty node, then we just recursively insert into the appropriate subtree (taking care to create new interior nodes if we’re at a leaf). Note that we do not do any additional computation to ensure that the tree is balanced (that each node has approximately as many children in its left subtree as in its right). Doing so would digress from the point of this primer, but rest assured that the problem of keeping trees balanced has long been solved.

Now the process of calling insert on every element of a list is just a simple application of fold. ML does have the standard map, foldl, foldr, and filter functions, although apparently filter is not available in the standard library (one needs to reference the List module via List.filter).

In any case, foldl is written in the currying style and building the tree is a simple application of it. As we said, the full sorting algorithm is just the result of flattening the resulting tree with our in-order traversal.

fun sort(L) = flatten(foldl(insert)(empty)(L))

So there you have it! One of the simplest (efficient) sorting algorithms I can think of in about twelve lines of code.

## Free Monoid Homomorphisms: A More Advanced Example

Just to get a quick taste of what our series on category theory will entail, let’s write a program with a slightly more complicated type signature. The idea hinges on the idea that lists form a what’s called a free monoid. In particular,

Definition: monoid is a set $X$ equipped with an associative binary operation $\cdot: X \times X \to X$ and an identity element $1 \in X$ for which $x1 = 1x = x$ for all $x \in X$.

Those readers who have been following our series on group theory will recognize a monoid as a group with less structure (there is no guarantee of inverses for the $\cdot$ operation). The salient fact for this example is that the set of ML values of type $\textup{'a list}$ forms a monoid. The operation is list concatenation, and the identity element is the empty list. Call the empty list nil and the append operation @, as it is in ML.

More than that, $\textup{'a list}$ forms a free monoid, and the idea of freeness has multiple ways of realization. One sort of elementary way to understand freeness is that the only way to use the binary operation to get to the identity is to have the two summands both be the identity element. In terms of lists, the only way to concatenate two lists to get the empty list is to have both pieces be empty to begin with.

Another, more mature (more category-theoretical) way to think about freeness is to say is satisfies the following “universal” property. Call $A$ the set of values of type ‘a, and $[A]$ the set of values of type $\textup{'a list}$, and suppose that $(B, \cdot_B, 1_B)$ is the datum of an arbitrary monoid. The universal property says that if we are given a function $f: A \to B$, and we take the canonical map $g: A \to [A]$ which maps an element $a \in A$ to the single-entry list $[a] \in [A]$, then there is a unique way to extend $f$ to a monoid homomorphism $f^* : [A] \to B$ on lists, so that $f^*g = f$. We have mentioned monoid homomorphisms on this blog before in the context of string metrics, but briefly a monoid homomorphism respects the monoid structure in the sense that (for this example) $f^*(a \textup{ @ } b) = f^*(a) \cdot_B f^*(b)$ no matter what $a, b$ are.

This was quite a mouthful, but the data is often written in terms of a so-called “commutative diagram,” whose general definition we will defer to a future post. The diagram for this example looks like:

The dashed line says we are asserting the existence of $f^*$, and the symbol $\exists !$ says this function exists and is uniquely determined by $f, g$. The diagram “commutes” in the sense that traveling from $A$ to $B$ along $f$ gives you the same computational result as traveling by $g$ and then $f^*$. The reason for the word “universal” will become clear in future posts, but vaguely it’s because the set $[A]$ is a unique “starting place” in a special category.

If this talk is too mysterious, we can just go ahead and prove that $f^*$ exists by writing a program that computes the function transforming $f \mapsto f^*$. We call the function “listMonoidLift” because it “lifts” the function $f$ from just operating on $A$ to the status of a monoid homomorphism. Very regal, indeed.

Part of the beauty of this function is that a number of different list operations (list length, list sum, member tests, map, etc.), when viewed under this lens, all become special cases of this theorem! By thinking about things in terms of monoids, we write less code, and more importantly we recognize that these functions all have the same structural signature. Perhaps one can think of it like parametric polymorphism on steroids.

fun listMonoidLift(f:('a->'b), (combine:(('b * 'b) -> 'b), id:'b)) =
let
fun f'(nil) = id
in
f'
end

Here we specified the types of the input arguments to be completely clear what’s what. The first argument is our function $f$ as in the above diagram, and the second two arguments together form the data of our monoid (the set $B$ is implicitly the collection of types $'b$ determined at the time of the function call). Now let’s see how the list summation and list length functions can be written in terms of the listMonoidLift function.

fun plus(x, y) = x + y
val sum = listMonoidLift((fn x => x), (plus, 0))
val length = listMonoidLift((fn x => 1), (plus, 0))

The plus function on integers with zero as the identity is the monoid $B$ in both cases (and also happens to be $A$ by coincidence), but in the summation case the function $f$ is the identity and for length it is the constant $1$ function.

As a more interesting example, see how list membership is a lift.

fun member(x) = listMonoidLift((fn y => y = x),
((fn (a,b) => a orelse b), false))

Here the member function is curried; it has type ‘a -> ‘a list -> bool (though it’s a bit convoluted since listMonoidLift is what’s returning the ‘a list -> bool part of the type signature). Here the $B$ monoid is the monoid of boolean values, where the operation is logical “or” and the identity is false. It is a coincidence and a simple exercise to prove that $B$ is a free monoid as well.

Now the mapping $f(y)$ is the test to see if $y$ is the same object as $x$. The lift to the list monoid will compute the logical “or” of all evaluations of $f$ on the values.

Indeed, (although this author hates bringing up too many buzzwords where they aren’t entirely welcome) the monoid lifting operation we’ve just described is closely related to the MapReduce framework (without all the distributed computing parts). Part of the benefit of MapReduce is that the programmer need only define the Map() and Reduce() functions (the heart of the computation) and MapReduce does the rest. What this example shows is that defining the Map() function can be even simpler: one only needs define the function $f$, and Map() is computed as $f^*$ automatically. The Reduce() part is simply the definition of the target monoid $B$.

Just to drive this point home, we give the reader a special exercise: write map as a special case of listMonoidLift. The result (the map function) should have one of the two type signatures:

map : ('a -> 'b) * ('a list) -> 'b list
map : 'a -> 'b -> 'a list -> b' list

As a hint, the target monoid should also be a list monoid.

Part of why this author is hesitant to bring up contemporary software in discussing these ideas is because the ideas themselves are far from contemporary. Burstall and Landin’s, 1969 text Programs and Their Proofs (which this author would love to find a copy of) details this exact reduction and other versions of this idea in a more specific kind of structure called “free algebras.” So MapReduce (minus the computer programs and distributed systems) was a well-thought-out idea long before the internet or massively distributed systems were commonplace.

In any case, we’ll start the next post in this series right off with the mathematical concept of a category. We’ll start slow and detail many examples of categories that show up both in (elementary) mathematics and computing, and then move on to universal properties, functors, natural transformations, and more, implementing the ideas all along the way.

Until then!

# Word Segmentation, or Makingsenseofthis

## A First Look at Google’s N-Gram Corpus

In this post we will focus on the problem of finding the appropriate word boundaries in strings like “homebuiltairplanes”, as is common in web URLs like www.homebuiltairplanes.com. This is an interesting problem because humans do it so easily, but there is no obvious programmatic solution. We will begin this article by addressing the complexity of this problem, continue by implementing a simple model using a subset of Google’s n-gram corpus, and finish by describing our future plans to enhance the model. As usual, all of the code and data used in this post is available from this blog’s Github page.

## Word Segmentation

We just claimed word segmentation is a hard problem, but in fact the segmentation part is quite easy! We’ll give a quick overview of the segmentation algorithm which assumes that we can evaluate a segmentation for optimality.

First, we note that by an elementary combinatorial argument there are $2^{n-1}$ segmentations of a word with $n$ letters. To see this, imagine writing a segmentation of “homebuiltairplanes” with vertical bars separating the letters, as in “home | built | ai |  rpla | nes”. The maximum number of vertical bars we could place is one less than the number of letters, and every segmentation can be represented by describing which of the $n-1$ gaps contain vertical bars and which do not. We can hence count up all segmentations by counting up the number of ways to place the bars. A computer scientist should immediately recognize that these “bars” can represent digits in a binary number, and hence all binary numbers with $n-1$ digits correspond to valid segmentations, and these range from 0 to $2^{n-1} - 1$, giving $2^{n-1}$ total numbers.

One can also prove this by fixing the number $k$ of segments the word is broken into and counting all the segmentations for each $k$. We leave it as an exercise to the reader.

The key fact to glean from that counting problem is that enumerating all possible segmentations is unfeasible. The right way to approach this problem is to mimic a human’s analysis. This will result in a dynamic program, and readers who have followed along will recognize the approach from our post on the Levenshtein metric. In addition we have a Python primer which doubles as a primer on dynamic programming.

A human who looks at the word “homebuiltairplanes” performs a mental scanning of the word. First we check if “h” is a good word, and then move to “ho”, “hom”, and finally decide “home” is likely the right word. From there we split the word into “home” and “builtairplanes”, and segment the remainder appropriately. In addition, one does not consider any of the segmentations whose first word is “h”, and neither one which begins with “homebu,” and the reason is that these are not words (nor are they likely to mean anything even if they aren’t words; we will discuss this nuance soon). Hence, all of the $2^{15}$ segmentations that make the first word “h” can be disregarded, as can the $2^{11}$ whose first word is “homebu”.

However, following these directions too strictly (and programs are quite strict in following directions) will get us into a number of traps. The mathematical mind will come up with many examples for which this reasoning is ambiguous or flat-out faulty. Here are two such suspects. The first is “lovesnails”. If we prematurely pick the first word as “love”, then we will be forced to choose the segmentation “love snails”, when it might truly be “loves nails”. Regardless of which segmentation is more likely, we simply need to consider them both in whatever judgement system we come up with. The second example is “bbcamerica”, as in www.bbcamerica.com. In particular, if we expect to find dictionary words in our segmentation, we’ll be quite disappointed, because no word begins with “bb”. Our solution then is to break free of the dictionary, and to allow potentially odd segmentations, being confident that our criterion for judging segmentation optimality will successfully discard the bad ones.

Before we liberate ourselves from the confines of Oxford’s vocabulary, let’s implement the procedure that performs the segmentation, abstracting out the optimality criterion. We note that (as usual with a dynamic program) we will need to memoize our recursion, and one should refer to our post on the Levenshtein metric and the Python primer on dynamic programming for an explanation of the “@memoize” decorator.

First, we implement the “splitPairs” function, which accepts a string $s$ as input and returns a list containing all possible split pairs $(u,v)$ where $s = uv$. We achieve this by a simple list comprehension (gotta love list comprehensions!) combined with string slicing.

def splitPairs(word):
return [(word[:i+1], word[i+1:]) for i in range(len(word))]

Indeed, “word[a:b]” computes a substring of “word” including the indices from $a$ to $b-1$, where blank entries for $a$ and $b$ denote the beginning and end of the string, respectively. For example, on the input string “hello”, this function computes:

>>> splitPairs("hello")
[('h', 'ello'), ('he', 'llo'), ('hel', 'lo'),
('hell', 'o'), ('hello', '')]

Note that the last entry in this list is crucial, because we may not want to segment the input word at all, and in the following we assume that “splitPairs” returns all of our possible choices of action. Next we define the “segment” function, which computes the optimal segmentation of a given word. In particular, we assume there is a global function called “wordSeqFitness” which reliably computes the fitness of a given sequence of words, with respect to whether or not it’s probably the correct segmentation.

def segment(word):
if not word: return []
allSegmentations = [[first] + segment(rest)
for (first, rest) in splitPairs(word)]
return max(allSegmentations, key = wordSeqFitness)

In particular, we are working by induction on the length of a word. Assuming we know the optimal segmentations for all substrings not including the first letter, we can construct the best segmentation which includes the first letter. The first line is the base case (there is only one choice for the empty string), and the second line is the meat of the computation. Here, we look at all possible split pairs, including the one which considers the entire word as a good segmentation, and we find the maximum of those segmentations with respect to “wordSeqFitness”. By induction we know that “segment” returns the optimal segmentation on every call, since each “rest” variable contains a strictly smaller substring which does not include the first letter. Hence, we have covered all possible segmentations, and the algorithm is correct.

Again, note how simple this algorithm was. Call it Python’s syntactic sugar if you will, but the entire segmentation was a mere four lines of logic. In other words, the majority of our work will be in figuring out exactly how to judge the fitness of a segmented word. For that, we turn to Google.

The biggest obstacle in “word segmentation” turns out to be identifying what words are, or rather what strings are most likely to have meaning. If we stick to a naive definition of what a word is, i.e., anything found in the Oxford English Dictionary, we end up with a very weak program. It wouldn’t be able to tell that “bbc” should be a word, as in our above example, and further it wouldn’t be able to tell which words are more likely than others (the OED has some pretty obscure words in it, but no mention of which are more common).

The solution here is to create a probabilistic model of the English language. Enter Google. To date Google has provided two major data corpora: the first, released in 2006, contains the frequency counts of language data from over a trillion words from a random sample of the entire internet. They gathered all n-grams, sequences of $n$ tokens (a “token” was defined arbitrarily by the researchers at Google), and reported the counts of those which occurred above a certain threshold. Unfortunately this corpus is over a hundred gigabytes of uncompressed data (24GB compressed), and consequently it can’t be found online (you have to order it on 6 DVDs from UPenn instead).

The second corpus is more specifically limited to language data from scanned books. Again organized as n-grams, this data set has some advantages and disadvantages. In particular, it was scanned directly from books, so optical character recognition was used to determine what image segments correspond to words, and what words they are. Due to deterioration or printing errors in the scanned books (for older years, these cover all books printed), this results in a lot of misinterpreted tokens. On the other hand, all of these sources are proofread for mistakes. Considering the banality of the web, we can probably assume this corpus has a higher data quality. Furthermore, tokens that appear commonly on the web (abbreviations like “lol” and “LotR” come to mind) are unlikely to occur in a book. For better or worse, if we use the book data set, we probably won’t have an easy time segmenting words which use such terms.

In the future we plan to obtain and process through the book data set (unless we can find the web data set somewhere…), but for now we use a convenient alternative. A researcher and Stanford professor Peter Norvig modified and released a subset of Google’s web corpus. He made all of the entries case insensitive, and he removed any tokens not consisting entirely of alphabet letters. Our first investigation will use his 1-gram file, which contains the 333,333 most commonly used single words. Perhaps unsurprisingly, this covers more than 98% of all 1-grams found, and the entire corpus easily fits into memory at a mere 5MB. Even better, we will design our program so that it works with data in any form, so the data sets can be swapped out at will.

Each line in Norvig’s data file has the form $word \left \langle tab \right \rangle n$. The first few lines are:

the	23135851162
of	13151942776
and	12997637966
to	12136980858
a	9081174698
in	8469404971
for	5933321709
is	4705743816
on	3750423199
that	3400031103
by	3350048871
this	3228469771
with	3183110675
i	3086225277
you	2996181025

While the last few are (quite uselessly):

goolld	12711
goolh	12711
goolgee	12711
googook	12711
googllr	12711
googlal	12711
googgoo	12711
googgol	12711
goofel	12711
gooek	12711
gooddg	12711
gooblle	12711
gollgo	12711
golgw	12711

The important aspect here is that the words have associated counts, and we can use them to probabilistically compare the fitness of two segmentations of a given word. In particular, we need to define a probabilistic model that evaluates the likelihood of a given sequence of words being drawn from a theoretical distribution of all sequences of words.

## Naive Bayes, the Reckless Teenager of Probability Models

The simplest useful model we could create is one in which we simply take the probability of each word in the segmentation and multiply them together. This is Bayes’s Rule, with the added assumption that each event (the occurrence of a word) is independent of the others. In common parlance, this is the equivalent to saying that the probability of a sequence of words occurring is the probability that the first word occurs (anywhere in the sequence) AND the second word occurs AND the third word occurs, and so on. Obviously the independence assumption is very strong. For a good counterexample, the sequence “united states of america” is far more common than the sequence “states america of united”, even though in this model they have the same probability.

The simplicity of this model makes it easy to implement, and hence it comes up often in applications, and so it has a special name: naive Bayes. In essence, a piece of data has a number of features (here, the features are the words in the sequence), and we make the assumption that the features occur independently of one another. The idea has been applied to a cornucopia of classification problems, but it usually serves a stepping stone to more advanced models which don’t ignore the underlying structure of the data. It can also be an indicator of whether more advanced models are really needed, or a measure of problem complexity.

For word segmentation, however, it’s a sensible simplification. Indeed, it is quite unlikely that a sequence of words will show up in a different order (as the complexity of the words in the segmentation increases, it’s probably downright impossible), so our “united states of america” example doesn’t concern us. On the other hand, we still have the example “lovesnails”, which the naive Bayes model might not sensibly handle. For a more contrived example, consider the segmentations of “endear”. Clearly “end” and “ear” are common words, but they are very rarely written in that order. Much more common is the single word “endear”, and an optimal model would take this into account, as well as three, four, and five-word sequences in our segmentation.

For a general model not pertaining just to segmentation, we’d truly love it if we could take into account $n$-grams for any $n$, because indeed some (very long) texts include words which have high probability of being placed together, but are far apart. Take, for instance, a text about sports in the UK. It is quite unlikely to see the word “soccer”, and more likely to see the word “wimbledon,” if the word United Kingdom shows up anywhere in the text (or, say, words like “Manchester”). There are some models which try to account for this (for instance, see the work on the “sequence memoizer” done at Columbia). However, here we expect relatively few tokens, so once we get there, we can limit our concerns to 5-grams at most. And, as most mathematicians and programmers agree, the best solution is often the simplest one.

## Implementation

Our naive Bayes probability model will end up being a class in Python. Upon instantiating the class, we will read in the data file, organize the word counts, and construct a way to estimate the probability of a word occurring, given our smaller data set.

In particular, our class will inherit the functionality of a dictionary. The relevant line is

class OneGramDist(dict):

This allows us to treat any instance of the class as if it were a dictionary. We can further override the different methods of a dictionary, extending the functionality in any way we wish, but for this application we won’t need to do that.

The relevant methods we need to implement are an initialization, which reads the data in, and a “call” method (the method called when one treats the object as a function; this will be clear from the examples that follow), which returns the probability of a given word.

   def __init__(self):
self.gramCount = 0
for line in open('one-grams.txt'):
(word, count) = line[:-1].split('\t')
self[word] = int(count)
self.gramCount += self[word]

def __call__(self, word):
if word in self:
return float(self[word]) / self.gramCount
else:
return 1.0 / self.gramCount

In the init function, we take each line in the “one-grams.txt” file, which has the form discussed above, and extract the word and count from it. We then store this data into the “self” object (which, remember, is also a dictionary), and then tally up all of the counts in the variable “gramCount”. In the call function, we simply return the number of times the word was counted, divided by the total “gramCount”. If a word has not been seen before, we assume it has a count of 1, and return the appropriate probability. We also note that this decision will come back to bite us soon, and we will need to enhance our model to give a better guess on unknown probabilities.

To plug this in to our segmentation code from earlier, we instantiate the distribution and implement “wordSeqFitness()” as follows:

singleWordProb = OneGramDist()
def wordSeqFitness(words):
return functools.reduce(lambda x,y: x+y,
(math.log10(singleWordProb(w)) for w in words))

First, this requires us to import both the “math” and “functools” libraries. The “reduce” function is the same as the “fold” function (see our primer on Racket and functional programming), and it combines a list according to a given function. Here the function is an anonymous function, and the syntax is

lambda arg1, arg2, ..., argN: expression

Here our lambda is simply adding two numbers. The list we give simply computes the probabilities of each word, and takes their logarithms. This actually does compute what we want to compute, but in logarithmic coordinates. In other words, we use the simple observation that $\log(ab) = \log(a) + \log(b)$. The reason for this is that each single word probability is on the order of $10^{-10}$, and so taking a product of around forty words will give a probability of near $10^{-400}$, which is smaller than the smallest floating point number Python allows. Python would report a probability of zero for such a product. In order to remedy this, we use this logarithm trick, and we leave it as an exercise to the reader to compute how ridiculously small the numbers we can represent in this coordinate system are.

## Drawbacks, and Improvements

Now let us see how this model fares with a few easy inputs.

\$ python -i segment.py
>>> segment("hellothere")
['hello', 'there']

So far so good.

>>> segment("himynameisjeremy")
['himynameisjeremy']

What gives?! Let’s fiddle with this a bit:

>>> segment("hi")
['hi']
>>> segment("himy")
['hi', 'my']
>>> segment("himyname")
['hi', 'my', 'name']
>>> segment("himynameis")
['himynameis']
>>> wordSeqFitness(['himynameis'])
-11.76946906492656
>>> wordSeqFitness(['hi', 'my', 'name', 'is'])
-11.88328049583244

There we have it. Even though we recognize that the four words above are common, their combined probability is lower than the probability of a single long word! In fact, this is likely to happen for any long string. To fix this, we need to incorporate information about the frequency of words based on their length. In particular, assume we have an unknown word $w$. It is unlikely to be a word if it’s longer than, say, ten characters long. There are some well-known distributions describing the frequency of actual words by their length, and it roughly fits the curve:

$f = aL^bc^L; a = 0.16, b = 2.33, c = 0.5$

Unfortunately this doesn’t help us. We want to know among all strings of a given length what proportion of them are words. With that, we can better model the probability of an unseen word actually being a word.

Luckily, we aren’t helpless nontechnical lambs. We are programmers! And we have a huge dictionary of words from Google’s n-gram corpus sitting right here! A quick python script (after loading in the dictionary using our existing code above) gives us the necessary numbers:

>>> for i in range(1,15):
...    words = [x for x in singleWordProb if len(x) == i]
...    print(len(words)/(26.0 ** i))
...
1.0
1.0
0.738336367774
0.0681436224222
0.00336097435179
0.000158748771704
6.23981380309e-06
2.13339205291e-07
6.52858936946e-09
1.84887277585e-10
4.6420710809e-12
1.11224101901e-13
2.54478475236e-15
5.68594239398e-17

Looking at the exponents, we see it’s roughly exponential with a base of $1/10$ for each length after 2, and we verify this by looking at a log plot in Mathematica:

The linearity of the picture tells us quite immediately that it’s exponential. And so we update the unknown word probability estimation as follows:

   def __call__(self, word):
if word in self:
return float(self[word]) / self.gramCount
else:
return 1.0 / (self.gramCount * 10**(len(key) - 2))

And indeed, our model now fares much better:

>>> segment("homebuiltairplanes")
['homebuilt', 'airplanes']
>>> segment("bbcamerica")
['bbc', 'america']
>>> segment("nevermind")
['nevermind']
>>> segment("icanhascheezburger")
['i', 'can', 'has', 'cheez', 'burger']
>>> segment("lmaorofllolwtfpwned")
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
>>> segment("wheninthecourseofhumanevents")
['when', 'in', 'the', 'course', 'of', 'human', 'events']
>>> segment("themostmercifulthingintheworldithinkistheinabilityofthehumanmindtocorrelateallitscontentsweliveonaplacidislandofignoranceinthemidstofblackseasofinfinityanditwasnotmeantthatweshouldvoyagefar")
['the', 'most', 'merciful', 'thing', 'in', 'the', 'world', 'i', 'think', 'is', 'the', 'inability', 'of', 'the', 'human', 'mind', 'to', 'correlate', 'all', 'its', 'contents', 'we', 'live', 'on', 'a', 'placid', 'island', 'of', 'ignorance', 'in', 'the', 'midst', 'of', 'black', 'seas', 'of', 'infinity', 'and', 'it', 'was', 'not', 'meant', 'that', 'we', 'should', 'voyage', 'far']
>>> s = 'Diffusepanbronchiolitisdpbisaninflammatorylungdiseaseofunknowncauseitisasevereprogressiveformofbronchiolitisaninflammatoryconditionofthebronchiolessmallairpassagesinthelungsthetermdiffusesignifiesthat'
>>> segment(s)
['diffuse', 'pan', 'bronchiolitis', 'dpb', 'is', 'an', 'inflammatory', 'lung', 'disease', 'of', 'unknown', 'cause', 'it', 'is', 'a', 'severe', 'progressive', 'form', 'of', 'bronchiolitis', 'an', 'inflammatory', 'condition', 'of', 'the', 'bronchioles', 'small', 'air', 'passages', 'in', 'the', 'lungs', 'the', 'term', 'diffuse', 'signifies', 'that']

Brilliant! The last bit is from the wikipedia page on diffuse panbronchiolitis, and it only mistakes the “pan” part of the admittedly obscure technical term. However, we are much more impressed by how our model embraces internet slang :). We can further verify intended behavior by deliberately misspelling a long word to see how the model fares:

>>> segment("antidisestablishmentarianism")
['antidisestablishmentarianism']
>>> segment("antidisestablishmentarianasm")
['anti', 'disestablishment', 'ariana', 'sm']

That second segmentation is certainly more likely than the original misspelled word (well, if we assume that no words are misspelled before segmentation).

Perhaps the most beautiful thing here is that this entire program is independent of the data used. If we wanted to instead write a program to segment, say, German words, all we’d need is a data file with the German counts (which Google also provides with its book corpus, along with Russian, Chinese, French, Spanish, and Hebrew). So we’ve written a much more useful program than we originally intended. Now compare this to the idea of a program which hard codes rules for a particular language, which was common practice until the 1980’s. Of course, it’s obvious now how ugly that method is, but apparently it’s still sometimes used to augment statistical methods for language-specific tasks.

Of course, we still have one flaw: the data is sloppy. For instance, the segmentation of “helloworld” is just “helloworld.” It turns out that this token appears on the internet in various forms, and commonly enough to outweigh the product of “hello” and “world” alone. Unfortunately, we can’t fix this by fiddling with the data we already have. Instead, we would need to extend the model to look at frequency counts of sequences of words (here, sequences of length 2). Google provides sequence counts of up to length 5, but they quickly grow far too large to fit in memory. One possible solution, which we postpone for a future post, would be to set up a database containing all 1-gram and 2-gram data, and therein bypass the need to store a big dictionary in memory. Indeed, then we could avoid truncating the 1-gram data as we did in this post.

## The Next Steps

Next time, we will look at another application of our truncated corpus: cryptanalysis. In particular, we will make an effort to break substitution ciphers via a local search method. In the not so near future, we also plan to investigate some other aspects of information representation. One idea which seems promising is to model a document as a point in some high dimensional space (perhaps each dimension corresponds to a count of a particular 1-gram), and then use our familiar friends from geometry to compare document similarity, filter through information, and determining the topic of a document via clustering.  The idea is called the vector space model for information retrieval. In addition, we can use the corpus along with our friendly Levenshtein metric to implement a spell-checker. Finally, we could try searching for phonetically similar words using a Levenshteinish metric on letter $n$-grams (perhaps $n$ is between 2 and 4).

Until next time!

P.S., in this model, “love snails” is chosen as the best segmentation over “loves nails.” Oh the interpretations…

# A Spoonful of Python (and Dynamic Programming)

This primer is a third look at Python, and is admittedly selective in which features we investigate (for instance, we don’t use classes, as in our second primer on random psychedelic images). We do assume some familiarity with the syntax and basic concepts of the language. For a first primer on Python, see A Dash of Python. We’ll investigate some of Python’s useful built-in types, including lists, tuples, and dictionaries, and we use them to computing Fibonacci numbers and “optimal” coin change.

## Fibonacci Numbers (the Wrong Way)

There’s an interesting disconnect between the mathematical descriptions of things and a useful programmatic implementation. Take for instance, the Fibonacci numbers $f_n$.

$f_1 = f_2 = 1$
$f_n = f_{n-1} + f_{n-2}$ for $n > 2$

A direct Python implementation of this definition is essentially useless. Here it is:

def fib(n):
if n <= 2:
return 1
else:
return fib(n-1) + fib(n-2)

Recalling our first Python primer, we recognize that this is a very different kind of “for” loop. Indeed, this is a recursive loop, which achieves the looping idea by having a function call itself with different arguments. For more examples of this (albeit, in a different language), see our primer A Taste of Racket.

In one mindset that will do us good later in the post, we are simplifying the problem of computing large Fibonacci numbers by breaking it up into a number of sub-problems. The sub-problems require us to compute smaller and smaller Fibonacci numbers, and we build up the sub-problems in some way to arrive at a final solution. This is the essence of dynamic programming, and we’ll see a more complicated example shortly.

Unfortunately, nobody in their right mind would use this piece of code. Trying to compute even $f_{100}$, this algorithm seems to never terminate! Some might throw their hands in the air and claim recursion is totally useless. Indeed, before we get to a working recursive solution (which this author considers more elegant), we will take the indignant view and look for a solution that is totally recursionless.

## Listomaina

Python has two built-in types which are list-like. The first is called a tuple, and the second is called a list. They function almost exactly the same way, but the first is immutable (that means you can’t change it after it’s created). We covered plain ‘ol lists in our first Python primer, but here are a few syntactic reminders, and show off some new list methods.

myList = ['a', 2, "gamma", ["wtf another list!", 5]]
myList[-1] = "replacing the list by a string"
myList.append(myList[2:4])
secondList = [2*i for i in range(10)]

As in the second line, we can use negative indices to access the elements in reverse order. In the third, we can access slices of the list using the colon notation. And in the last line, we have the generally awesome and useful list comprehensions. This follows more or less closely with the usual notation for constructing lists (of course, using Pythonic syntax). However, one must be careful, because each “for” in a list comprehension literally corresponds to an actual for loop. In other words, the following two bits of code are equivalent:

myList = [x*y*z for x in range(100) for y in range(100)
for z in range(100)]
myList = []
for x in range(100):
for y in range(100):
for z in range(100):
myList.append(x*y*z)

On the other hand, tuples have almost precisely the same syntax, except they are constructed with parentheses, and tuple comprehensions have an extra bit of syntax. Here are the same examples as above, except with tuples.

myTuple = ('a', 2, "gamma", ("wtf another tuple!", 5))
myTuple[-1] = 2 # not allowed to assign!
myTuple.append(myTuple[2:4]) # not allowed to append!
secondTuple = tuple(2*i for i in range(10))

The middle two lines are invalid. Tuples support neither assignment nor appending, and any sort of alteration of a tuple is forbidden. All of the fuss comes from the fact that when you pass a list as a function argument, the function can change the contents of the list. If you’re on a large software team and you don’t look at what the function does (or use some function whose code you can’t view), then you might not not anticipate this. Converting a list to a tuple before passing it is a safety measure.

Indeed, the conversion is in the last line but we could have first create a list, and then converted it to a tuple with “tuple()”.

With lists in our pocket, we can come up with a way to compute Fibonacci numbers quickly: we just save the old computations in a list (and do away with all that recursion nonsense).

def fib(n):
fibValues = [0,1]
for i in range(2,n+1):
fibValues.append(fibValues[i-1] + fibValues[i-2])
return fibValues[n]

One difference here is that lists index staring at zero, so we have to go up to $n+1$. In fact, some authors set $f_0 = 0$, so this works out just fine. In fact, we can even compute such huge Fibonacci numbers as $f_{100,000}$, and it only takes about four seconds. Compare this to the infinitude of time it took our naive algorithm to compute $f_{100}$ and you’ll see why some forsake recursion so quickly.

Before we go on to a more general discussion and a more interesting problem, we note that trick to compute Fibonacci numbers with a single recursive call. We leave this as an exercise to the reader (hint: instead of the argument being a single number $n$, pass a tuple containing two useful values). Of course, this method can just as well be done with a loop instead of recursion. The difference is that there is no growing list as we go along, so we say this algorithm takes a constant amount of space (or just constant space). And, for completeness, depending on the language certain kinds of recursion do “take up space” in a sense, and Python happens to be a language that does that. For more on this, see tail call optimization.

## Dynamic Programming

The feat we just accomplished in computing Fibonacci numbers quickly does generalize to more interesting problems and much harder problems. So hard, in fact, that the method has its own name: dynamic programming. It’s hard to give a precise (and concise) definition for when dynamic programming applies. This author likes to think of it as “the method you need when it’s easy to phrase a problem using multiple branches of recursion, but it ends up taking forever since you compute the same old crap way too many times.” According to Wikipedia, this means we have “overlapping sub-problems” which are just slightly smaller than the goal. In the case of Fibonacci numbers, this was that we could compute $f_n$ from $f_{n-1}$ and $f_{n-2}$, but their computations in turn had overlapping sub-computations (both use the value of $f_{n-3}$).

We avoided the problem of computing stuff multiple times by storing our necessary values in a list, and then building the final solution from the bottom up. This turns out to work for a plethora of problems, and we presently turn to one with some real world applications.

## Coin Change

Most readers will have made change for a certain amount of money using a fixed number of coins, and one can ask, “what is the smallest number of coins I can use to make exact change?” The method of picking the largest coins first (and only taking smaller coins when you need to) happens to give the optimal solution in many cases (U.S. coins are one example). However, with an unusual set of coins, this is not always true. For instance, the best way to make change for thirty cents if you have quarters, dimes, and pennies, would be to use three dimes, as opposed to a quarter and five pennies.

This example shows us that the so-called “greedy” solution to this problem doesn’t work. So let’s try a dynamic algorithm. As with Fibonacci numbers, we need to compute some sub-problems and build off of them to get an optimal solution for the whole problem.

In particular, denote our coin values $v_1, \dots, v_k$ and the amount of change we need to make $n$. Let us further force $v_1 = 1$, so that there is guaranteed to be a solution. Our sub-problems are to find the minimal number of coins needed to make change if we only use the first $i$ coin values, and we make change for $j$ cents. We will eventually store these values in a two-dimensional array, and call it $\textup{minCoins}[i,j]$. In other words, this is a two-dimensional dynamic program, because we have sub-problems that depend on two parameters.

The base cases are easy: how many coins do I need to make change for zero cents? Zero! And how many pennies do I need to make $j$ cents? Exactly $j$. Now, suppose we know $\textup{minCoins}[i-x,j]$ and $\textup{minCoins}[i,j-y]$ for all relevant $x, y$. To determine $\textup{minCoins}[i,j]$, we have two choices. First, we can use a coin of value $v_i$, and add 1 to the array entry $\textup{minCoins}[i, j - v_i]$, which contains the best solution if we have $j - v_i$ cents. Second, we can ignore the fact that we have coins of value $v_i$ available, and we can use the solution $\textup{minCoins}[i-1,j]$, which gives us the same amount of money either.

In words, the algorithm says we can suppose our last coin is the newly introduced coin, and if that doesn’t improve on our best solution yet (above, $\textup{minCoins}[i,j-v_i]$) then we don’t use the new coin, and stick with our best solution whose last coin is not the new one (above, $\textup{minCoins}[i-1,j]$).

Let’s implement this idea:

def coinChange(centsNeeded, coinValues):
minCoins = [[0 for j in range(centsNeeded + 1)]
for i in range(len(coinValues))]
minCoins[0] = range(centsNeeded + 1)

for i in range(1,len(coinValues)):
for j in range(0, centsNeeded + 1):
if j < coinValues[i]:
minCoins[i][j] = minCoins[i-1][j]
else:
minCoins[i][j] = min(minCoins[i-1][j],
1 + minCoins[i][j-coinValues[i]])

return minCoins[-1][-1]

The first two lines of the function initialize the array with zeros, and fill the first row with the relevant penny values. The nested for loop runs down each row, updating the coin values as described above. The only difference is that we check to see if the coin value is too large to allow us to choose it (j < coinValues[i], or equivalently j – coinValues < 0). In that case, we default to not using that coin, and there is no other way to go. Finally, at the end we return the last entry in the last row of the matrix.

We encourage the reader to run this code on lots of examples, and improve upon it. For instance, we still assume that the first entry in coinValues is 1.

## Memoized Recursion

Essentially all of dynamic programming boils down to remembering the solutions to sub-problems. One might ask: do we really need all of this array indexing junk to do that? Isn’t there a way that hides all of those details so we can focus on the core algorithm? Of course, the answer to that question is yes, but it requires some more knowledge about Python’s built in data types.

In the fill justify problem above, we uniquely identified a sub-problem using two integers, and indexing the rows and columns of a two-dimensional array (a list of lists). However, the fact that it was a two-dimensional array is wholly irrelevant. We could use a one-dimensional list, as long as we had some way to uniquely identify a sub-problem via a key we have access to. The official name for such a look-up table is a hash table.

In most standard computer science courses, this is the point at which we might delve off into a long discussion about hash tables (hashing functions, collision strategies, resizing methods, etc.). However for the sake of brevity we recognize that most languages, including Python, have pretty good hash tables built in to them. The details are optimized for our benefit. In Python, they are called dictionaries, and they are just as easy to use as lists.

dict = {}
dict[1] = 1
dict[2] = 1
dict[3] = 2
dict[(1,2,3,4)] = "weird"
dict["odd"] = 876

if 2 in dict:
print("2 is a key!")
print("its value is" + str(dict[2]))

dict2 = {1:1, 2:1, 3:2} # key:value pairs

The curly-brace notation is reserved for dictionaries (almost), and it comes with a number of useful functions. As expected, “in” tests for key membership, but there are also a number of useful functions for extracting other kinds of information. We won’t need any of them here, however. Finally, the last line shows a quick way of creating new tables where you know the values ahead of time.

We can use a dictionary to save old computations in a function call, replacing the egregious extra computations by lightning fast dictionary look-ups. For our original Fibonacci problem, we can do this as follows:

fibTable = {1:1, 2:1}
def fib(n):
if n <= 2:
return 1
if n in fibTable:
return fibTable[n]
else:
fibTable[n] = fib(n-1) + fib(n-2)
return fibTable[n]

We simply check to see if the key in question has already been computed, and if so, we just look the value up in the dictionary. And now we can compute large values of $f_n$ quickly! (Unfortunately, due to Python’s limit on recursion depth, we still can’t go as large as the non-recursive solution, but that is a different issue altogether). However we can do better in terms of removing the gunk from our algorithm. Remember, we want to completely hide the fact that we’re using dictionaries. This brings us to a very neat Python feature called decorators.

## Decorators

A decorator is a way to hide pre- or post-processing to a function. For example, say we had a function which returns some text:

def getText():
return "Here is some text!"

And we want to take this function and for whatever text it might return, we want to display it on a webpage, so we need to surround it with the HTML paragraph tags <p></p>. We might even have a hundred such functions which all return text, so we want a way to get all of them at once without typing the same thing over and over again. This problem is just begging for a decorator.

A decorator accepts a function $f$ as input, and returns a function which potentially does something else, but perhaps uses $f$ in an interesting way. Here is the paragraph example:

def paragraphize(inputFunction):
def newFunction():
return "<p>" + inputFunction() + "</p>"
return newFunction

@paragraphize
def getText():
return "Here is some text!"

“paragraphize” is the decorator here, and this function must accept a function as input, and return a function as output. The output function can essentially do whatever it wants, and it will replace the input function. Indeed, the macro “@paragraphize” precedes whatever function we wish to decorate, and it replaces all calls to “getText()” by calls to “newFunction(getText)”. In other words, it is shorthand for the following statement

getText = paragraphize(getText)

Moreover, the function that paragraphize outputs must accept the same number of arguments as the input function accepts. This is to be expected, and the way we signify multiple arguments in Python is with the * notation. Here is an example:

def printArgsFirst(f):
def newFunction(*args):
print(args)
return f(*args)

return newFunction

@printArgsFirst
def randomFunction(x, y, z):
return x + y + z

The “*args” notation above represents multiple comma-separated values, and when we refer to “args” without the star, it is simply a tuple containing the argument values. If we include the *, as we do in the call to f(*args), it unpacks the values from the tuple and inserts them into the call as if we had called f(x,y,z), as opposed to f((x,y,z)), which calls f with a single argument (a tuple).

Applying this to our memoization problem, we arrive at the following decorator.

def memoize(f):
cache = {}

def memoizedFunction(*args):
if args not in cache:
cache[args] = f(*args)
return cache[args]

memoizedFunction.cache = cache
return memoizedFunction

We have a saved cache, and update it whenever we encounter a new call to f(). The second to last line simply allows us to access the cache from other parts of the code by attaching it to memoizedFunction as an attribute (what attributes are requires another discussion about objects, but essentially we’re making the dictionary a piece of data that is part of memoizedFunction).

And now we come full circle, back to our original (easy to read) implementation of Fibonacci numbers, with one tiny modification:

@memoize
def fib(n):
if n <= 2:
return 1
else:
return fib(n-1) + fib(n-2)

Again, it is unfortunate that we still cannot compute ridiculously huge values of $f_n$, but one has to admit this is the cleanest way to write the algorithm. We encourage the reader re-solve the fill justify problem using this memoized recursion (and make it easy to read!).

For the adventurous readers, here is a list of interesting problems that are solved via dynamic programming. One common feature that requires it is the justify alignment in word processors (this author has a messy Java implementation of it: 300 lines not including tests! I admit I was a boyish, reckless programmer). All dynamic programs follow the same basic pattern, so with the tools we have now one should be able to go implement them all!

Until next time!

# Random (Psychedelic) Art

## And a Pinch of Python

Next semester I am a lab TA for an introductory programming course, and it’s taught in Python. My Python experience has a number of gaps in it, so we’ll have the opportunity for a few more Python primers, and small exercises to go along with it. This time, we’ll be investigating the basics of objects and classes, and have some fun with image construction using the Python Imaging Library. Disappointingly, the folks who maintain the PIL are slow to update it for any relatively recent version of Python (it’s been a few years since 3.x, honestly!), so this post requires one use Python 2.x (we’re using 2.7). As usual, the full source code for this post is available on this blog’s Github page, and we encourage the reader to follow along and create his own randomized pieces of art! Finally, we include a gallery of generated pictures at the end of this post. Enjoy!

## How to Construct the Images

An image is a two-dimensional grid of pixels, and each pixel is a tiny dot of color displayed on the screen. In a computer, one represents each pixel as a triple of numbers $(r,g,b)$, where $r$ represents the red content, $g$ the green content, and $b$ the blue content. Each of these is a nonnegative integer between 0 and 255. Note that this gives us a total of $256^3 = 2^{24}$ distinct colors, which is nearly 17 million. Some estimates of how much color the eye can see range as high as 10 million (depending on the definition of color) but usually stick around 2.4 million, so it’s generally agreed that we don’t need more.

The general idea behind our random psychedelic art is that we will generate three randomized functions $(f,g,h)$ each with domain and codomain $[-1,1] \times [-1,1]$, and at each pixel $(x,y)$ we will determine the color at that pixel by the triple $(f(x,y), g(x,y), h(x,y))$. This will require some translation between pixel coordinates, but we’ll get to that soon enough. As an example, if our colors are defined by the functions $(\sin(\pi x), \cos(\pi xy), \sin(\pi y))$, then the resulting image is:

We use the extra factor of $\pi$ because without it the oscillation is just too slow, and the resulting picture is decidedly boring. Of course, the goal is to randomly generate such functions, so we should pick a few functions on $[-1,1]$ and nest them appropriately. The first which come to mind are $\sin(\pi \cdot -), \cos(\pi \cdot -),$ and simple multiplication. With these, we can create such convoluted functions like

$\sin(\pi x \cos(\pi xy \sin(\pi (\cos (\pi xy)))))$

We could randomly generate these functions two ways, but both require randomness, so let’s familiarize ourselves with the capabilities of Python’s random library.

## Random Numbers

Pseudorandom number generators are a fascinating topic in number theory, and one of these days we plan to cover it on this blog. Until then, we will simply note the basics. First, contemporary computers can not generate random numbers. Everything on a computer is deterministic, meaning that if one completely determines a situation in a computer, the following action will always be the same. With the complexity of modern operating systems (and the aggravating nuances of individual systems), some might facetiously disagree.

For an entire computer the “determined situation” can be as drastic as choosing every single bit in memory and the hard drive. In a pseudorandom number generator the “determined situation” is a single number called a seed. This initializes the random number generator, which then proceeds to compute a sequence of bits via some complicated arithmetic. The point is that one may choose the seed, and choosing the same seed twice will result in the same sequence of “randomly” generated numbers. The default seed (which is what one uses when one is not testing for correctness) is usually some sort of time-stamp which is guaranteed to never repeat. Flaws in random number generator design (hubris, off-by-one errors, and even using time-stamps!) has allowed humans to take advantage of people who try to rely on random number generators. The interested reader will find a detailed account of how a group of software engineers wrote a program to cheat at online poker, simply by reverse-engineering the random number generator used to shuffle the deck.

In any event, Python makes generating random numbers quite easy:

import random

random.seed()
print(random.random())
print(random.choice(["clubs", "hearts", "diamonds", "spades"]))

We import the random library, we seed it with the default seed, we print out a random number in $(0,1)$, and then we randomly pick one element from a list. For a full list of the functions in Python’s random library, see the documentation. As it turns out, we will only need the choice() function.

## Representing Mathematical Expressions

One neat way to represent a mathematical function is via…a function! In other words, just like Racket and Mathematica and a whole host of other languages, Python functions are first-class objects, meaning they can be passed around like variables. (Indeed, they are objects in another sense, but we will get to that later). Further, Python has support for anonymous functions, or lambda expressions, which work as follows:

>>> print((lambda x: x + 1)(4))
5

So one might conceivably randomly construct a mathematical expression by nesting lambdas:

import math

def makeExpr():
if random.random() < 0.5:
return lambda x: math.sin(math.pi * makeExpr()(x))
else:
return lambda x: x

Note that we need to import the math library, which has support for all of the necessary mathematical functions and constants. One could easily extend this to support two variables, cosines, etc., but there is one flaw with the approach: once we’ve constructed the function, we have no idea what it is. Here’s what happens:

>>> x = lambda y: y + 1
>>> str(x)
'<function <lambda> at 0xb782b144>'

There’s no way for Python to know the textual contents of a lambda expression at runtime!  In order to remedy this, we turn to classes.

The inquisitive reader may have noticed by now that lots of things in Python have “associated things,” which roughly correspond to what you can type after suffixing an expression with a dot. Lists have methods like “[1,2,3,4].append(5)”, dictionaries have associated lists of keys and values, and even numbers have some secretive methods:

>>> 45.7.is_integer()
False

In many languages like C, this would be rubbish. Many languages distinguish between primitive types and objects, and numbers usually fall into the former category. However, in Python everything is an object. This means the dot operator may be used after any type, and as we see above this includes literals.

A class, then, is just a more transparent way of creating an object with certain associated pieces of data (the fancy word is encapsulation). For instance, if I wanted to have a type that represents a dog, I might write the following Python program:

class Dog:
age = 0
name = ""

def bark(self):
print("Ruff ruff! (I'm %s)" % self.name)

Then to use the new Dog class, I could create it and set its attributes appropriately:

fido = Dog()
fido.age = 4
fido.name = "Fido"
fido.weight = 100
fido.bark()

The details of the class construction requires a bit of explanation. First, we note that the indented block of code is arbitrary, and one need not “initialize” the member variables. Indeed, they simply pop into existence once they are referenced, as in the creation of the weight attribute. To make it more clear, Python provides a special function called “__init__()” (with two underscores on each side of “init”; heaven knows why they decided it should be so ugly), which is called upon the creation of a new object, in this case the expression “Dog()”. For instance, one could by default name their dogs “Fido” as follows:

class Dog:
def __init__(self):
self.name = "Fido"

d = Dog()
d.name             # contains "Fido"

This brings up another point: all methods of a class that wish to access the attributes of the class require an additional argument. The first argument passed to any method is always the object which represents the owning instance of the object. In Java, this is usually hidden from view, but available by the keyword “this”. In Python, one must explicitly represent it, and it is standard to name the variable “self”.

If we wanted to give the user a choice when instantiating their dog, we could include an extra argument for the name like this:

class Dog:
def __init__(self, name = 'Fido'):
self.name = name

d = Dog()
d.name                   # contains "Fido"
e = Dog("Manfred")
e.name                   # contains "Manfred"

Here we made it so the “name” argument is not required, and if it is excluded we default to “Fido.”

To get back to representing mathematical functions, we might represent the identity function on $x$ by the following class:

class X:
def eval(self, x, y):
return x

expr = X()
expr.eval(3,4)           # returns 3

That’s simple enough. But we still have the problem of not being able to print anything sensibly. Trying gives the following output:

>>> str(X)
'__main__.X'

In other words, all it does is print the name of the class, which is not enough if we want to have complicated nested expressions. It turns out that the “str” function is quite special. When one calls “str()” of something, Python first checks to see if the object being called has a method called “__str__()”, and if so, calls that. The awkward “__main__.X” is a default behavior. So if we soup up our class by adding a definition for “__str__()”, we can define the behavior of string conversion. For the X class this is simple enough:

class X:
def eval(self, x, y):
return x

def __str__(self):
return "x"

For nested functions we could recursively convert the argument, as in the following definition for a SinPi class:

class SinPi:
def __str__(self):
return "sin(pi*" + str(self.arg) + ")"

def eval(self, x, y):
return math.sin(math.pi * self.arg.eval(x,y))

Of course, this requires we set the “arg” attribute before calling these functions, and since we will only use these classes for random generation, we could include that sort of logic in the “__init__()” function.

To randomly construct expressions, we create the function “buildExpr”, which randomly picks to terminate or continue nesting things:

def buildExpr(prob = 0.99):
if random.random() < prob:
return random.choice([SinPi, CosPi, Times])(prob)
else:
return random.choice([X, Y])()

Here we have classes for cosine, sine, and multiplication, and the two variables. The reason for the interesting syntax (picking the class name from a list and then instantiating it, noting that these classes are objects even before instantiation and may be passed around as well!), is so that we can do the following trick, and avoid unnecessary recursion:

class SinPi:
def __init__(self, prob):
self.arg = buildExpr(prob * prob)

...

In words, each time we nest further, we exponentially decrease the probability that we will continue nesting in the future, and all the nesting logic is contained in the initialization of the object. We’re building an expression tree, and then when we evaluate an expression we have to walk down the tree and recursively evaluate the branches appropriately. Implementing the remaining classes is a quick exercise, and we remind the reader that the entire source code is available from this blog’s Github page. Printing out such expressions results in some nice long trees, but also some short ones:

>>> str(buildExpr())
'cos(pi*y)*sin(pi*y)'
>>> str(buildExpr())
'cos(pi*cos(pi*y*y*x)*cos(pi*sin(pi*x))*cos(pi*sin(pi*sin(pi*x)))*sin(pi*x))'
>>> str(buildExpr())
'cos(pi*cos(pi*y))*sin(pi*sin(pi*x*x))*cos(pi*y*cos(pi*sin(pi*sin(pi*x))))*sin(pi*cos(pi*sin(pi*x*x*cos(pi*y)))*cos(pi*y))'
>>> str(buildExpr())
'cos(pi*cos(pi*sin(pi*cos(pi*y)))*cos(pi*cos(pi*x)*y)*sin(pi*sin(pi*x)))'
>>> str(buildExpr())
'sin(pi*cos(pi*sin(pi*cos(pi*cos(pi*y)*x))*sin(pi*y)))'
>>> str(buildExpr())
'cos(pi*sin(pi*cos(pi*x)))*y*cos(pi*cos(pi*y)*y)*cos(pi*x)*sin(pi*sin(pi*y*y*x)*y*cos(pi*x))*sin(pi*sin(pi*x*y))'

This should work well for our goals. The rest is constructing the images.

## Images in Python, and the Python Imaging Library

The Python imaging library is part of the standard Python installation, and so we can access the part we need by adding the following line to our header:

from PIL import Image

Now we can construct a new canvas, and start setting some pixels.

canvas = Image.new("L", (300,300))
canvas.putpixel((150,150), 255)
canvas.save("test.png", "PNG")

This gives us a nice black square with a single white pixel in the center. The “L” argument to Image.new() says we’re working in grayscale, so that each pixel is a single 0-255 integer representing intensity. We can do this for three images, and merge them into a single color image using the following:

finalImage = Image.merge("RGB",
(redCanvas, greenCanvas, blueCanvas))

Where we construct “redCanvas”, “greenCanvas”, and “blueCanvas” in the same way above, but with the appropriate intensities. The rest of the details in the Python code are left for the reader to explore, but we dare say it is just bookkeeping and converting between image coordinate representations. At the end of this post, we provide a gallery of the randomly generated images, and a text file containing the corresponding expression trees is packaged with the source code on this blog’s Github page.

## Extending the Program With New Functions!

There is decidedly little mathematics in this project, but there are some things we can discuss. First, we note that there are many many many functions on the interval $[-1,1]$ that we could include in our random trees. A few examples are: the average of two numbers in that range, the absolute value, certain exponentials, and reciprocals of interesting sequences of numbers. We leave it as an exercise to the reader to add new functions to our existing code, and to further describe which functions achieve coherent effects.

Indeed, the designs are all rather psychedelic, and the layers of color are completely unrelated. It would be an interesting venture to write a program which, given an image of something (pretend it’s a simple image containing some shapes), constructs expression trees that are consistent with the curves and lines in the image. This follows suit with our goal of constructing low-complexity pictures from a while back, and indeed, these pictures have rather low Kolmogorov complexity. This method is another framework in which to describe their complexity, in that smaller expression trees correspond to simpler pictures. We leave this for future work. Until then, enjoy these pictures!