# Persistent Homology Talk at UIC: Slides

Today I gave a twenty-minute talk at UI Chicago as part of the first annual Chicago Area Student SIAM Conference. My talk was titled “Recent Developments in Persistent Homology,” and it foreshadows the theoretical foundations and computational implementations we’ll be laying out on this blog in the coming months. Here’s the abstract:

Persistent homology is a recently developed technique for analyzing the topology of data sets. We will give a rough overview of the technique and sample successful applications to areas such as natural image analysis & texture classification, breast and liver cancer classification, molecular dynamical systems, and others.

The talk was received very well — mostly, I believe, because I waved my hands on the theoretical aspects and spent most of my time talking about the applications.

In any case, although the slides I used for my talk were largely unannotated (I spoke much more than is contained in text on the slides), I did list a number of references to papers that have shown successful applications. As such, some of the audience members asked me to post the slides on the web.

I personally hate it when people post slides because they’re often taken out of context (or just slide after slide of dense text), so this is just a warning to the reader. If you didn’t attend my talk, the chances are you won’t get much out of these slides. If you did, I hope the slides will be useful for the references and to jog your memory about my talk.

About these ads

# Introducing Categories

It is time for us to formally define what a category is, to see a wealth of examples. In our next post we’ll see how the definitions laid out here translate to programming constructs. As we’ve said in our soft motivational post on categories, the point of category theory is to organize mathematical structures across various disciplines into a unified language. As such, most of this post will be devote to laying down the definition of a category and the associated notation. We will be as clear as possible to avoid a notational barrier for newcomers, so if anything is unclear we will clarify it in the comments.

## Definition of a Category

Let’s recall some examples of categories we’ve seen on this blog that serve to motivate the abstract definition of a category. We expect the reader to be comfortable with sets, and to absorb or glaze over the other examples as comfort dictates. The reader who is uncomfortable with sets and functions on sets should stop here. Instead, visit our primers on proof techniques, which doubles as a primer on set theory (or our terser primer on set theory from a two years ago).

The go-to example of a category is that of sets: sets together with functions between sets form a category. We will state exactly what this means momentarily, but first some examples of categories of “sets with structure” and “structure-preserving maps.”

Groups together with group homomorphisms form a category, as do rings and fields with their respective kinds of homomorphisms. Topological spaces together with continuous functions form a category, and metric spaces with distance-nonincreasing maps (“short” functions) form a sub-category. Vector spaces and linear maps, smooth manifolds and smooth maps, and algebraic varieties with rational maps all form categories. We could continue but the essential idea is clear: a category is some way to specify a collection of objects and “structure-preserving” mappings between those objects. There are three main features common to all of these examples:

1. Composition of structure-preserving maps produces structure-preserving maps.
2. Composition is associative.
3. There is an identity map for each object.

The main abstraction is that forgetting what the objects and mappings are and only considering how they behave allows one to deviate from the examples above in useful ways. For instance, once we see the formal definition below, it will become clear that mathematical (say, first-order logical) statements, together with proofs of implication, form a category. Even though a “proof” isn’t strictly a structure-preserving map, it still fits with the roughly stated axioms above. One can compose proofs by laying the implications out one after another, this composition is trivially associative, and there is an identity proof. Thus, proofs provide a way to “transform” true statements into true statements, preserving the structure of boolean-valued truth.

Another example is the category of ML types and computable functions in ML. Computable functions can be quite wild in their behavior, but they are nevertheless composable, associative, and equipped with an identity.

And so the definition of a category seems to come as a natural consequence to think of all of these examples as special cases of one concept.

Before we state the definition we should note that, for abstruse technical reasons, we cannot phrase the definition of a category as a “set” of objects and mappings between objects. This is already impossible for the category of sets, because there is no “set of all sets.” Somehow (as illustrated by Russell’s paradox) there are “too many” sets to do this. Likewise, there is no “set” of all groups, topological spaces, vector spaces, etc.

This apparent difficulty requires some sidestepping. One possibility is to define a universe of non-paradoxical sets, and define categories by way of a universe. Another is to define a class, which bypasses set theory in another way. We won’t deliberate on the differences between these methods of avoiding Russell’s paradox. The reader need only know that it can be done. For our official definition, we will use the terminology of classes.

Definition: A category $\textbf{C}$ consists of the following data:

• A class of objects, denoted $\textup{Obj}(\mathbf{C})$.
• For each pair of objects $A,B$, a set $\textup{Hom}(A,B)$ of morphisms. Sometimes these are called hom-sets.

The morphisms satisfy the following conditions:

• For all objects $A,B,C$ and morphisms $f \in \textup{Hom}(A,B), g \in \textup{Hom}(B,C)$ there is a composition operation $\circ$ and $g \circ f \in \textup{Hom}(A,C)$ is a morphism. We will henceforth drop the $\circ$ and write $gf$.
• Composition is associative.
• For all objects $A$, there is an identity morphism $1_A \in \textup{Hom}(A,A)$. For all $A,B$, and $f \in \textup{Hom}(A,B)$, we have $f 1_A = f$ and $1_B f = f$.

Some additional notation and terminology: we denote a morphism $f \in \textup{Hom}(A,B)$ in three ways. Aside from “as an element of a set,” the most general way is as a diagram,

$\displaystyle A \xrightarrow{\hspace{.5cm} f \hspace{.5cm}} B$

Although we will often shorten a single morphism to the standard function notation $f: A \to B$. Given a morphism $f: A \to B$, we call $A$ the source of $f$, and $B$ the target of $f$.

We will often name our categories, and we will do so in bold. So the category of sets with set-functions is denoted $\mathbf{Set}$. When working with multiple categories, we will give subscripts on the hom-sets to avoid ambiguity, as in $\textup{Hom}_{\mathbf{Set}}(A,B)$. The set of morphisms from an object to itself is called an endomorphism, and the set of all endomorphisms of an object $A$ is denoted $\textup{End}_{\mathbf{C}}(A)$.

Note that in the definition above we require that morphisms form a set. This is important because hom-sets will have additional algebraic structure (e.g., for certain categories $\textup{End}_{\mathbf{C}}(A)$ will form a ring).

## Examples of Categories

Lets now formally define some of our simple examples of categories. Defining a category amounts to specifying the objects and morphisms, and verifying the conditions in the definition hold.

Sets

Define the category $\mathbf{Set}$ whose objects are all sets, and whose morphisms are functions on sets. By now it should hopefully be clear that sets form a category, but let us go through the motions explicitly. Every set $A$ has an identity function $1_A(x) = x$, and as we already know, functions on sets compose associatively. To verify this in complete detail would amount to writing down a general function as a relation, and using the definitions from elementary set theory. We have more pressing matters, but a reader who has not seen this before should consult our set theory primer. One can also define the category of finite sets $\mathbf{FinSet}$ whose objects are finite sets and whose morphisms are again set-functions. As every object and morphism of $\mathbf{FinSet}$ is one of $\mathbf{Set}$, we call the latter a subcategory of the former.

Finite categories

The most trivial possible categories are those with only finitely many objects. For instance, define the category $\mathbf{1}$ to have a single object $\ast$ and a single morphism $1_{\ast}$, which must of course be the identity. The composition $1_{\ast} 1_{\ast}$ is forced to be $1_{\ast}$, and so this is a (rather useless) category. One can also imagine a category $\mathbf{2}$ which has one non-identity morphism $A \to B$ as well as examples of categories with any number of finite objects. Nothing interesting is going on here; we just completely specify the structure of the category object by object and morphism by morphism. Sometimes they come in handy, though.

Poset Categories

Here is an elementary example of a category that is nonetheless fundamental to modern discussions in topology and algebraic geometry. It will show up again in our work with persistent homology. Let $X$ be any set, and define the category $\mathbf{X}_\subset$ as follows. The objects of $\mathbf{X}_\subset$ are all the subsets of $X$. If $A \subset B$ are subsets of $X$, then the set $\textup{Hom}(A,B)$ is defined to be the (unique, unnamed) singleton $A \to B$. Otherwise, $\textup{Hom}(A,B)$ is the empty set. Identities exist since every set is a subset of itself. The property of being a subset is transitive, so composition of morphisms makes sense and associativity is trivial since there is at most one morphism between any two objects. If the reader doesn’t believe this, we state what composition is rigorously: define each morphism as a pair $(A,B)$ and define the composition operation as a set-function $\textup{Hom}(A,B) \times \textup{Hom}(B,C) \to \textup{Hom}(A,C)$, which sends $((A,B), (B,C)) \mapsto (A,C)$. Because this operation is so trivial, it seems more appropriate to state it with a diagram:

We say this diagram commutes if all ways to compose morphisms (travel along arrows) have equal results. That is, in the diagram above, we assert that the morphism $A \to B$ composed with the morphism $B \to C$ is exactly the one morphism $A \to C$. Usually one must prove that a diagram commutes, but in this case we are defining the composition operation so that commutativity holds. The reader can now directly verify that composition is associative:

$(a,b)((b,c)(c,d)) = (a,b)(b,d) = (a,d) = (a,c)(c,d) = ((a,b)(b,c))(c,d)$

More generally, it is not hard to see how any transitive reflexive relation on a set (including partial orders) can be used to form a category: objects are elements of the set, and morphisms are unique arrows which exist when the objects are (asymmetrically) related. The subset category above is a special case where the set in question is the power set of $X$, and the relation is $\subset$. The familiar reader should note that the most prominent example of this in higher mathematics is to have $X$ be the topology of a topological space (the set of open subsets).

Groups

Next, define the category $\mathbf{Grp}$ whose objects are groups and whose morphisms are group homomorphisms. Recall briefly that a group is a set $G$ endowed with a sensible (associative) binary operation denoted by multiplication, which has an identity and with respect to which every element has an inverse. For the uninitiated reader, just replace any abstract “group” by the set of all nonzero rational numbers with usual multiplication. It will suffice for this example.

A group homomorphism is a set-function $f: A \to B$ which satisfies $f(xy) = f(x)f(y)$ (here the binary operations on the left and right side of the equal sign are the operations in the two respective groups). Being set-functions, group homomorphisms are composable as functions and associatively so. Given any homomorphism $g: B \to C$, we need to show that the composite is again a homomorphism:

$\displaystyle gf(xy) = g(f(xy)) = g(f(x)f(y)) = g(f(x))g(f(y)) = gf(x) gf(y)$

Note that there are three multiplication operations floating around in this equation. Groups (as all sets) have identity maps, and the identity map is a perfectly good group homomorphism. This verifies that $\mathbf{Grp}$ is indeed a category. While we could have stated all of these equalities via commutative diagrams, the pictures are quite large and messy so we will avoid them until later.

A similar derivation will prove that rings form a category, as do vector spaces, topological spaces, and fields. We are unlikely to use these categories in great detail in this series, so we refrain from giving them names for now.

One special example of a category of groups is the category $\mathbf{Ab}$ of abelian groups (for which the multiplication operation is commutative). This category shows up as the prototypical example of a so-called “abelian category,” which means it has enough structure to do homology.

Graphs

In a more discrete domain, define by $\mathbf{Graph}$ as follows. The objects in this category are triples $(V,E, \varphi)$ where $\varphi: E \to V \times V$ represents edge adjacency. We will usually supress $\varphi$ by saying vertices $v,w$ are adjacent instead of $\varphi(e) = (v,w)$. The morphisms in this category are graph homomorphisms. That is, if $G = (V,E,\varphi)$ and $G' = (V', E',\varphi')$, then $f \in \textup{Hom}_{\mathbf{Graph}}(G,G')$ is a pair of set functions on $f_V: V \to V', f_E: E \to E'$, satisfying the following commutative diagram. Here we denote by $(f,f)$ the map which sends $(u,v) \to (f(u), f(v))$.

This diagram is quite a mouthful, but in words it requires that whenever $v,w$ are adjacent in $G$, then $f(v), f(w)$ are adjacent in $G'$. Rewriting the diagram as an equation more explicitly, we are saying that if $e \in E$ is an edge with $\varphi(e) = (v,w)$, then it must be the case that $(f_V(v), f_V(w)) = \varphi'(f_E(e))$. This is how one “preserves” the structure of a graph.

To prove this is a category, we can observe that composition makes sense: given two pairs

$\displaystyle f_V:V \to V'$
$\displaystyle f_E: E \to E'$

and

$\displaystyle g_{V'}: V' \to V''$
$\displaystyle g_{E'}: E' \to E''$

we can compose each morphism individually, getting $gf_V: V \to V''$ and $gf_E: E \to E''$, and the following hefty-looking commutative diagram.

Let’s verify commutativity together: we already know the two squares on the left and right commute (by hypothesis, they are morphisms in this category). So we have two things left to check, that the three ways to get from $E$ to $V'' \times V''$ are all the same. This amounts to verifying the two equalities

$\displaystyle (g_{V'}, g_{V'}) (f_V, f_V) \varphi = (g_{V'}, g_{V'}) \varphi' f_E = \varphi'' g_{E'} f_E$

But indeed, going left to right in the above equation, each transition from one expression to another only swaps two morphisms within one of the commutative squares. In other words, the first equality is already enforced by the commutativity of the left-hand square, and the second by the right. We are literally only substituting what we already know to be equal!

If it feels like we didn’t actually do any work there (aside from unravelling exactly what the diagrams mean), then you’re already starting to see the benefits of category theory. It can often feel like a cycle: commutative diagrams make it easy to argue about other commutative diagrams, and one can easily get lost in the wilderness of arrows. But more often than not, devoting a relatively small amount of time up front to show one diagram commutes will make a wealth of facts and theorems follow with no extra work. This is part of the reason category theory is affectionately referred to as “abstract nonsense.”

Diagram Categories

Speaking of abstract nonsense: next up is a very abstract example, but thinking about it will reinforce the ideas we’ve put forth in this post while giving a sneak peek to our treatment of universal properties. Fix an object $A$ in an arbitrary category $\mathbf{C}$. Define the category $\mathbf{C}_A$ whose objects are morphisms with source $A$. That is, an object of $\mathbf{C}_A$ looks like

Where $B$ ranges over all objects of $\mathbf{C}$. The morphisms of $\mathbf{C}_A$ are commutative diagrams of the form

where as we said earlier the stipulation asserted by the diagram is that $f' = gf$. Let’s verify that the axioms for a category hold. Suppose we have two such commutative diagrams with matching target and source, say,

Note that the arrows $g: A \to C$ must match up in both diagrams, or else composition does not make sense! Then we can combine them into a single commutative diagram:

If it is not obvious that this diagram commutes, we need only write down the argument explicitly: by the first diagram $\beta f = g$ and $\gamma g = h$, and piecing these together we have $\gamma \beta f = \gamma g = h$. Associativity of this “piecing together” follows from the associativity of the morphisms in $\mathbf{C}$. The identity morphism is a diagram whose two “legs” are both $f: A \to B$ and whose connecting morphism is just $1_B$.

This kind of “diagram” category is immensely important, and we will revisit it and many variants of it in the future. A quick sneak peek: this category is closely related to the universal properties of polynomial rings, free objects, and limits.

As a slight generalization, you can define a category whose objects consist of pairs of morphisms with a shared source (or target), i.e.,

We challenge the reader to write down what a morphism of this category would look like, and prove the axioms for a category hold (it’s not hard, but a bit messy). We will revisit categories like this one in our post on universal properties; this particular one is intimately related to products.

## Next Time

Next time we’ll jump straight into some code, and realize the definition of a category as a type in ML. We’ll see how some of our examples of categories here can be implemented using the code, and inspect the pro’s and con’s of the computational version of our definition.

# Categories, What’s the Point?

Perhaps primarily due to the prominence of monads in the Haskell programming language, programmers are often curious about category theory. Proponents of Haskell and other functional languages can put category-theoretic concepts on a pedestal or in a mexican restaurant, and their benefits can seem as mysterious as they are magical. For instance, the most common use of a monad in Haskell is to simulate the mutation of immutable data. Others include suspending and backtracking computations, and even untying tangled rope.

Category theory is often mocked (or praised) as the be-all and end-all of mathematical abstraction, and as such (and for other reasons I’ve explored on this blog) many have found it difficult to digest and impossible to master. However, in truth category theory arose from a need for organizing mathematical ideas based on their shared structure. In this post, I want to give a brief overview of what purpose category theory serves within mathematics, and emphasize what direction we’ll be going with it on this blog.

We should also note (writing this after the fact), that this article is meant to be a motivation to our future series on category theory. It is very difficult to explain what category theory is without going into very specific details, but we can explain by analogy what category theory achieves for us. That is the goal of this post.

## Category Theory as a Modern Language

It would be a silly question to ask why we don’t program entirely in binary. It’s a slow, inefficient process prone to mistakes of all sorts, and so we generally avoid it (although a well-rounded programmer can readily work in binary when it is necessary). But once upon a time there was no choice. Eventually people found that certain programmatic constructions were ubiquitous, and the developers of the next generation of languages abstracted these ideas to make types, lists, loops, if statements, and functions. The cycle continued: we found that we needed to further allow programmers to define custom data types, polymorphic operations, protected data, and others. Another iteration and we have list comprehensions, Python decorators, Javascript promises, and a whole host of other programming paradigms that have turned into features [1].

So it is with programming as it is in mathematics. I would digress by detailing the history of numbers as symbols, or of the transition to set-based mathematics in the late 1800′s to early 1900′s. But within the last fifty years there has been a major revolution in the discourse of modern mathematics: the integration of category theory.

I have to contextualize this immediately, because I don’t want to enter a metamathematical discussion about what the proper foundation of all mathematics should be. This is a huge can of worms riddled with strong opinions and logical fallacies. I’m not saying that mathematicians have unanimously agreed that category theory is the correct basis for all logic and mathematics. I wouldn’t claim that just as I wouldn’t claim that all programmers agree that object-oriented programming is the best model for all programs, because that would be ridiculous and myopic.

I am saying, however, that enough mathematicians have agreed category theory is useful and convenient. As a result category theory is the contemporary baseline for serious discussion in many fields of pure mathematics. In short, anyone who wants to do serious mathematics in these fields must learn the language of category theory.

One goal of this blog’s category theory series is to gain fair fluency in this modern language.

## Category Theory as an Organizing Principle

As most readers will readily understand, people who study and develop programming languages think differently about language features than people who just use programming languages. For instance, a systems programmer friend of mine was pleased to discover that Python supports multiple variable assignments in a single statement (e.g. a,b = 7, False). Now a language designer might smile and think, “Well that’s just syntactic sugar.” But to a working programmer, this kind of feature is quite convenient and extensible (to one-line variable swapping, multiple return values, etc.). On the other hand, when a language designer claims to have invented a wonderful new paradigm, say, continuations, the working programmer would think, “What good is that for? It’s just an ugly way to write functions!” It’s not until someone uses the feature to do amazing things like eliminate the need for a stack or provide lightweight extensible stack inspection that the working programmer appreciates the abstraction.

Analogously, those who study category theory from a logical viewpoint see it very differently than those who use it to come up with new mathematics. Many would argue it was not until the work of Alexander Grothendieck (in the late 1950′s and 60′s) that non-category-theorists saw the true value of applying category theory to their fields. Grothendieck implemented a sweeping reform of a number of mathematical fields, his most notable stemming from the invention of étale cohomology. Now I’m not a number theorist or an algebraic geometer, but I know enough to understand category theory’s role there.

The main benefit to using category theory is as a way to organize and synthesize information. This is particularly true of the concept of a universal property. We will hear more about this in due time, but as it turns out most important mathematical structures can be phrased in terms of universal properties. Moreover, a universal property jumps right to the heart of why a construction is important. For example, one new to mathematics might wonder why polynomials are so ubiquitous and important. The answer is (vaguely) that they satisfy a universal property which makes them canonical extensions of certain computations.

I want to make this point very clear, because most newcomers to category theory are never told this. Category theory exists because it fills a need. Even if that need is a need for better organization and a refocusing of existing definitions. It was not just an attempt to build higher abstractions, but a successful adaptation of mathematics to a more complex world.

And so as category theory has spread through the mathematical world, more and more definitions are phrased in terms of various universal constructions in special categories. This is a good thing precisely because there are only two universal properties. As we’ll see, by stating that an object satisfies a universal property, one immediately understands how the proof will progress, and many properties like uniqueness and invariance will follow trivially. Familiar readers of this blog will remember our posts on groups (and will read future posts on rings and fields), in which we state and prove theorems about quotients and products and isomorphism theorems which are all essentially the same across the various fields. Viewing the problem abstractly in category theory allows one to prove all of these theorems simultaneously, and study the differences between the objects via the properties of the category as a whole.

To reiterate, category theory streamlines the process of making precise technical definitions and proving their well-definition. One hopes, then, that very general theorems proved within category theory can apply to a wide breadth of practical areas. [2]

## Category as a Tool to Gain More Knowledge

When someone invents a new programming tool, the primary goal is usually to allow a programmer to do something that he couldn’t do previously (or was difficult/inconvenient to do). For instance, programmers invented version control to allow for easy project collaboration and rollbacks. Before then, managing multiple versions of a file was a horrendous task.

In mathematics, we can ask the poignant question: what does category theory allow us to do that we couldn’t do before? This should be read as besides having a new way to think about mathematical structures and besides having a more efficient language for discourse. Of course, this is a highly philosophical question. Could it be that there is some (non-categorical) theorem that can’t be proved unless you resort to category-theoretical arguments? In my optimistic mind the answer must certainly be no. Moreover, it appears that most proofs that “rely” on category theory only really do so because they’re so deeply embedded in the abstraction that unraveling them to find non-category-theoretical proofs would be a tiresome and fruitless process.

In programming we can ask a related question: what insights does category theory give us about programming? Can we write programs better if we resort to organizing things in terms of categories? Would it be easier to prove correctness of our programs or to discover a good algorithm to solve a task?

I think it goes without saying that we certainly can’t do anything that we couldn’t have done before (this would violate the notion that our usual languages are Turing complete). But these other soft questions should have positive answers. While in the previous two sections I gave concrete reasons why one might want to learn category theory, here the reason is very vague. Supposedly, learning category theory makes one a better programmer by forcing one to make connections between structures and computation. Then when a new problem comes along, it becomes easy (almost natural!) to fit it into a categorical organization of the world, and the right solution just “falls into your lap.”

While I don’t necessarily espouse this line of thinking (I believe any mathematics makes you better at analyzing problems), this is essentially the argument for why functional programming is a good thing to learn.

## What We’ll Do With Categories

I don’t necessarily have any amazing applications of category theory in mind for this blog. Instead, I want to develop a fair fluency and categorical organization (the first to sections of this article) among my readers. Along the way, we will additionally implement the concepts of category theory in code. This will give us a chance to play with the ideas as we learn, and hopefully will make all of the abstract nonsense much more concrete.

So next time we’ll start with the definition of a category, and give a wealth of examples. Until then!

[1] I’m obviously oversimplifying the history of programming languages, but the spirit is true, and the same as for all technological developments: incremental improvements based on a need for convenience and removing repetitive tasks. (back)
[2] One place this is particularly convenient is actually in the theory of persistent homology. Though on this blog the plan is to avoid the theory side before we investigate the algorithm from a high-level, once we get to the theory we will see an effective use of category theory in action. (back)

# Probabilistic Bounds — A Primer

Probabilistic arguments are a key tool for the analysis of algorithms in machine learning theory and probability theory. They also assume a prominent role in the analysis of randomized and streaming algorithms, where one imposes a restriction on the amount of storage space an algorithm is allowed to use for its computations (usually sublinear in the size of the input).

While a whole host of probabilistic arguments are used, one theorem in particular (or family of theorems) is ubiquitous: the Chernoff bound. In its simplest form, the Chernoff bound gives an exponential bound on the deviation of sums of random variables from their expected value.

This is perhaps most important to algorithm analysis in the following mindset. Say we have a program whose output is a random variable $X$. Moreover suppose that the expected value of $X$ is the correct output of the algorithm. Then we can run the algorithm multiple times and take a median (or some sort of average) across all runs. The probability that the algorithm gives a wildly incorrect answer is the probability that more than half of the runs give values which are wildly far from their expected value. Chernoff’s bound ensures this will happen with small probability.

So this post is dedicated to presenting the main versions of the Chernoff bound that are used in learning theory and randomized algorithms. Unfortunately the proof of the Chernoff bound in its full glory is beyond the scope of this blog. However, we will give short proofs of weaker, simpler bounds as a straightforward application of this blog’s previous work laying down the theory.

If the reader has not yet intuited it, this post will rely heavily on the mathematical formalisms of probability theory. We will assume our reader is familiar with the material from our first probability theory primer, and it certainly wouldn’t hurt to have read our conditional probability theory primer, though we won’t use conditional probability directly. We will refrain from using measure-theoretic probability theory entirely (some day my colleagues in analysis will like me, but not today).

## Two Easy Bounds of Markov and Chebyshev

The first bound we’ll investigate is almost trivial in nature, but comes in handy. Suppose we have a random variable $X$ which is non-negative (as a function). Markov’s inequality is the statement that, for any constant $a > 0$,

$\displaystyle \textup{P}(X \geq a) \leq \frac{\textup{E}(X)}{a}$

In words, the probability that $X$ grows larger than some fixed constant is bounded by a quantity that is inversely proportional to the constant.

The proof is quite simple. Let $\chi_a$ be the indicator random variable for the event that $X \geq a$ ($\chi_a = 1$ when $X \geq a$ and zero otherwise). As with all indicator random variables, the expected value of $\chi_a$ is the probability that the event happens (if this is mysterious, use the definition of expected value). So $\textup{E}(\chi_a) = \textup{P}(X \geq a)$, and linearity of expectation allows us to include a factor of $a$:

$\textup{E}(a \chi_a) = a \textup{P}(X \geq a)$

The rest of the proof is simply the observation that $\textup{E}(a \chi_a) \leq \textup{E}(X)$. Indeed, as random variables we have the inequality $a \chi_a \leq X$. Whenever $a < X$, the former is equal to zero while the latter is nonnegative. And whenever $a \geq X$, the former is precisely $a$ while the latter is by assumption at least $a$. It follows that $\textup{E}(a \chi_a) \leq \textup{E}(X)$.

This last point is a simple property of expectation we omitted from our first primer. It usually goes by monotonicity of expectation, and we prove it here. First, if $X \geq 0$ then $\textup{E}(X) \geq 0$ (this is trivial). Second, if $0 \leq X \leq Y$, then define a new random variable $Z = Y-X$. Since $Z \geq 0$ and using linearity of expectation, it must be that $\textup{E}(Z) = \textup{E}(Y) - \textup{E}(X) \geq 0$. Hence $\textup{E}(X) \geq \textup{E}(Y)$. Note that we do require that $X$ has a finite expected value for this argument to work, but if this is not the case then Markov’s inequality is nonsensical anyway.

Markov’s inequality by itself is not particularly impressive or useful. For example, if $X$ is the number of heads in a hundred coin flips, Markov’s inequality ensures us that the probability of getting at least 99 heads is at most 50/99, which is about 1/2. Shocking. We know that the true probability is much closer to $2^{-100}$, so Markov’s inequality is a bust.

However, it does give us a more useful bound as a corollary. This bound is known as Chebyshev’s inequality, and its use is sometimes referred to as the second moment method because it gives a bound based on the variance of a random variable (instead of the expected value, the “first moment”).

The statement is as follows.

Chebyshev’s Inequality: Let $X$ be a random variable with finite expected value and positive variance. Then we can bound the probability that $X$ deviates from its expected value by a quantity that is proportional to the variance of $X$. In particular, for any $\lambda > 0$,

$\displaystyle \textup{P}(|X - \textup{E}(X)| \geq \lambda) \leq \frac{\textup{Var}(X)}{\lambda^2}$

And without any additional assumptions on $X$, this bound is sharp.

Proof. The proof is a simple application of Markov’s inequality. Let $Y = (X - \textup{E}(X))^2$, so that $\textup{E}(Y) = \textup{Var}(X)$. Then by Markov’s inequality

$\textup{P}(Y \geq \lambda^2) \leq \frac{\textup{E}(Y)}{\lambda^2}$

Since $Y$ is nonnegative $|X - \textup{E}(X)| = \sqrt(Y)$, and $\textup{P}(Y \geq \lambda^2) = \textup{P}(|X - \textup{E}(X)| \geq \lambda)$. The theorem is proved. $\square$

Chebyshev’s inequality shows up in so many different places (and usually in rather dry, technical bits), that it’s difficult to give a good example application.  Here is one that shows up somewhat often.

Say $X$ is a nonnegative integer-valued random variable, and we want to argue about when $X = 0$ versus when $X > 0$, given that we know $\textup{E}(X)$. No matter how large $\textup{E}(X)$ is, it can still be possible that $\textup{P}(X = 0)$ is arbitrarily close to 1. As a colorful example, let $X$ is the number of alien lifeforms discovered in the next ten years. We might debate that $\textup{E}(X)$ can arbitrarily large: if some unexpected scientific and technological breakthroughs occur tomorrow, we could discover an unbounded number of lifeforms. On the other hand, we are very likely not to discover any, and probability theory allows for such a random variable to exist.

If we know everything about $\textup{Var}(X)$, however, we can get more informed bounds.

Theorem: If $\textup{E}(X) \neq 0$, then $\displaystyle \textup{P}(X = 0) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}$.

Proof. Simply choose $\lambda = \textup{E}(X)$ and apply Chebyshev’s inequality.

$\displaystyle \textup{P}(X = 0) \leq \textup{P}(|X - \textup{E}(X)| \geq \textup{E}(X)) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}$

The first inequality follows from the fact that the only time $X$ can ever be zero is when $|X - \textup{E}(X)| = \textup{E}(X)$, and $X=0$ only accounts for one such possibility. $\square$

This theorem says more. If we know that $\textup{Var}(X)$ is significantly smaller than $\textup{E}(X)^2$, then $X > 0$ is more certain to occur. More precisely, and more computationally minded, suppose we have a sequence of random variables $X_n$ so that $\textup{E}(X_n) \to \infty$ as $n \to \infty$. Then the theorem says that if $\textup{Var}(X_n) = o(\textup{E}(X_n)^2)$, then $\textup{P}(X_n > 0) \to 1$. Remembering one of our very early primers on asymptotic notation, $f = o(g)$ means that $f$ grows asymptotically slower than $g$, and in terms of this fraction $\textup{Var}(X) / \textup{E}(X)^2$, this means that the denominator dominates the fraction so that the whole thing tends to zero.

## The Chernoff Bound

The Chernoff bound takes advantage of an additional hypothesis: our random variable is a sum of independent coin flips. We can use this to get exponential bounds on the deviation of the sum. More rigorously,

Theorem: Let $X_1 , \dots, X_n$ be independent random $\left \{ 0,1 \right \}$-valued variables, and let $X = \sum X_i$. Suppose that $\mu = \textup{E}(X)$. Then the probability that $X$ deviates from $\mu$ by more than a factor of $\lambda > 0$ is bounded from above:

$\displaystyle \textup{P}(X > (1+\lambda)\mu) \leq \frac{e^{\lambda \mu}}{(1+\lambda)^{(1+\lambda)\mu}}$

The proof is beyond the scope of this post, but we point the interested reader to these lecture notes.

We can apply the Chernoff bound in an easy example. Say all $X_i$ are fair coin flips, and we’re interested in the probability of getting more than 3/4 of the coins heads. Here $\mu = n/2$ and $\lambda = 1/2$, so the probability is bounded from above by

$\displaystyle \left ( \frac{e}{(3/2)^3} \right )^{n/4} \approx \frac{1}{5^n}$

So as the number of coin flips grows, the probability of seeing such an occurrence diminishes extremely quickly to zero. This is important because if we want to test to see if, say, the coins are biased toward flipping heads, we can simply run an experiment with $n$ sufficiently large. If we observe that more than 3/4 of the flips give heads, then we proclaim the coins are biased and we can be assured we are correct with high probability. Of course, after seeing 3/4 of more heads we’d be really confident that the coin is biased. A more realistic approach is to define some $\varepsilon$ that is small enough so as to say, “if some event occurs whose probability is smaller than $\varepsilon$, then I call shenanigans.” Then decide how many coins and what bound one would need to make the bad event have probability approximately $\varepsilon$. Finding this balance is one of the more difficult aspects of probabilistic algorithms, and as we’ll see later all of these quantities are left as variables and the correct values are discovered in the course of the proof.

## Chernoff-Hoeffding Inequality

The Hoeffding inequality (named after the Finnish statistician, Wassily Høffding) is a variant of the Chernoff bound, but often the bounds are collectively known as Chernoff-Hoeffding inequalities. The form that Hoeffding is known for can be thought of as a simplification and a slight generalization of Chernoff’s bound above.

Theorem: Let $X_1, \dots, X_n$ be independent random variables whose values are within some range $[a,b]$. Call $\mu_i = \textup{E}(X_i)$, $X = \sum_i X_i$, and $\mu \textup{E}(X) = \sum_i \mu_i$. Then for all $t > 0$,

$\displaystyle \textup{P}(|X - \mu| > t) \leq 2e^{(-2t^2)/(b-a)^2}$

For example, if we are interested in the sum of $n$ rolls of a fair six-sided die, then the probability that we deviate from $(7/2)n$ by more than $\sqrt{n}$ is bounded by $2e^{(-2n / 25)}$. Supposing we want to know how many rolls we need to guarantee with probability 0.01 that we don’t deviate too much, we just do the algebra:

$2e^{(-2n / 25)} < 0.01$
$-2n / 25 < \log (0.005)$
$n > \log(0.005^{(-25/2)}) \approx 66$

So with 67 rolls we can be confident that the sum of the rolls will lie between 222 and 240.

Another version of this theorem concerns the average of the $X_i$, and is only a minor modification of the above.

Theorem: If $X_1, \dots, X_n$ are as above, and $X = \frac{1}{n} \sum_i X_i$, with $\mu = \frac{1}{n}(\sum_i \mu_i)$, then for all $t > 0$, we get the following bound

$\displaystyle \textup{P}(|X - \mu| > t) \leq 2e^{(-2nt^2)/(b-a)^2}$

The only difference here is the extra factor of $n$ in the exponent. So the deviation is exponential both in the amount of deviation ($t^2$), and in the number of trials.

This theorem comes up very often in learning theory, in particular to prove Boosting works. Mathematicians will joke about how all theorems in learning theory are just applications of Chernoff-Hoeffding-type bounds. We’ll of course be seeing it again as we investigate boosting and the PAC-learning model in future posts, so we’ll see the theorems applied to their fullest extent then.

Until next time!

# Computing Homology

In our last post in this series on topology, we defined the homology group. Specifically, we built up a topological space as a simplicial complex (a mess of triangles glued together), we defined an algebraic way to represent collections of simplices called chains as vectors in a vector space, we defined the boundary homomorphism as a linear map on chains, and finally defined the homology groups as the quotient vector spaces

$\displaystyle H_k(X) = \frac{\textup{ker} \partial_k}{\textup{im} \partial_{k+1}}$.

The number of holes in $X$ was just the dimension of this quotient space.

In this post we will be quite a bit more explicit. Because the chain groups are vector spaces and the boundary mappings are linear maps, they can be represented as matrices whose dimensions depend on our simplicial complex structure. Better yet, if we have explicit representations of our chains by way of a basis, then we can use row-reduction techniques to write the matrix in a standard form.

Of course the problem arises when we want to work with two matrices simultaneously (to compute the kernel-mod-image quotient above). This is not computationally any more difficult, but it requires some theoretical fiddling. We will need to dip a bit deeper into our linear algebra toolboxes to see how it works, so the rusty reader should brush up on their linear algebra before continuing (or at least take some time to sort things out if or when confusion strikes).

Without further ado, let’s do an extended example and work our ways toward a general algorithm. As usual, all of the code used for this post is available on this blog’s Google Code page.

## Two Big Matrices

Recall our example simplicial complex from last time.

We will compute $H_1$ of this simplex (which we saw last time was $\mathbb{Q}$) in a more algorithmic way than we did last time.

Once again, we label the vertices 0-4 so that the extra “arm” has vertex 4 in the middle, and its two endpoints are 0 and 2. This gave us orientations on all of the simplices, and the following chain groups. Since the vertex labels (and ordering) are part of the data of a simplicial complex, we have made no choices in writing these down.

$\displaystyle C_0(X) = \textup{span} \left \{ 0,1,2,3,4 \right \}$

$\displaystyle C_1(X) = \textup{span} \left \{ [0,1], [0,2], [0,3], [0,4], [1,2], [1,3],[2,3],[2,4] \right \}$

$\displaystyle C_2(X) = \textup{span} \left \{ [0,1,2], [0,1,3], [0,2,3], [1,2,3] \right \}$

Now given our known definitions of $\partial_k$ as an alternating sum from last time, we can give a complete specification of the boundary map as a matrix. For $\partial_1$, this would be

$\displaystyle \partial_1 = \bordermatrix{ & [0,1] & [0,2] & [0,3] & [0,4] & [1,2] & [1,3] & [2,3] & [2,4] \cr 0 & -1 & -1 & -1 & -1 & 0 & 0 & 0 & 0\cr 1 & 1 & 0 & 0 & 0 & -1 & -1 & 0 & 0\cr 2 & 0 & 1 & 0 & 0 & 1 & 0 & -1 & -1 \cr 3 & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 \cr 4 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 }$,

where the row labels are the basis for $C_0(X)$ and the column labels are the basis for $C_1(X)$. Similarly, $\partial_2$ is

$\displaystyle \partial_2 = \bordermatrix{ & [0,1,2] & [0,1,3] & [0,2,3] & [1,2,3] \cr [0,1] & 1 & 1 & 0 & 0\cr [0,2] & -1 & 0 & 1 & 0\cr [0,3] & 0 & -1 & -1 & 0\cr [0,4] & 0 & 0 & 0 & 0\cr [1,2] & 1 & 0 & 0 & 1\cr [1,3] & 0 & 1 & 0 & -1\cr [2,3] & 0 & 0 & 1 & 1\cr [2,4] & 0 & 0 & 0 & 0}$

The reader is encouraged to check that these matrices are written correctly by referring to the formula for $\partial$ as given last time.

Remember the crucial property of $\partial$, that $\partial^2 = \partial_k \partial_{k+1} = 0$. Indeed, the composition of the two boundary maps just corresponds to the matrix product of the two matrices, and one can verify by hand that the above two matrices multiply to the zero matrix.

We know from basic linear algebra how to compute the kernel of a linear map expressed as a matrix: column reduce and inspect the columns of zeros. Since the process of row reducing is really a change of basis, we can encapsulate the reduction inside a single invertible matrix $A$, which, when left-multiplied by $\partial$, gives us the reduced form of the latter. So write the reduced form of $\partial_1$ as $\partial_1 A$.

However, now we’re using two different sets of bases for the shared vector space involved in $\partial_1$ and $\partial_2$. In general, it will no longer be the case that $\partial_kA\partial_{k+1} = 0$. The way to alleviate this is to perform the “corresponding” change of basis in $\partial_{k+1}$. To make this idea more transparent, we return to the basics.

## Changing Two Matrices Simultaneously

Recall that a matrix $M$ represents a linear map between two vector spaces $f : V \to W$. The actual entries of $M$ depend crucially on the choice of a basis for the domain and codomain. Indeed, if $v_i$ form a basis for $V$ and $w_j$ for $W$, then the $k$-th column of the matrix representation $M$ is defined to be the coefficients of the representation of $f(v_k)$ in terms of the $w_j$. We hope to have nailed this concept down firmly in our first linear algebra primer.

Recall further that row operations correspond to changing a basis for the codomain, and column operations correspond to changing a basis for the domain. For example, the idea of swapping columns $i,j$ in $M$ gives a new matrix which is the representation of $f$ with respect to the (ordered) basis for $V$ which swaps the order of $v_i , v_j$. Similar things happen for all column operations (they all correspond to manipulations of the basis for $V$), while analogously row operations implicitly transform the basis for the codomain. Note, though, that the connection between row operations and transformations of the basis for the codomain are slightly more complicated than they are for the column operations. We will explicitly see how it works later in the post.

And so if we’re working with two maps $A: U \to V$ and $B: V \to W$, and we change a basis for $V$ in $B$ via column reductions, then in order to be consistent, we need to change the basis for $V$ in $A$ via “complementary” row reductions. That is, if we call the change of basis matrix $Q$, then we’re implicitly sticking $Q$ in between the composition $BA$ to get $(BQ)A$. This is not the same map as $BA$, but we can make it the same map by adding a $Q^{-1}$ in the right place:

$\displaystyle BA = B(QQ^{-1})A = (BQ)(Q^{-1}A)$

Indeed, whenever $Q$ is a change of basis matrix so is $Q^{-1}$ (trivially), and moreover the operations that $Q$ performs on the columns of $B$ are precisely the operations that $Q^{-1}$ performs on the rows of $A$ (this is because elementary row operations take different forms when multiplied on the left or right).

Coming back to our boundary operators, we want a canonical way to view the image of $\partial_{k+1}$ as sitting inside the kernel of $\partial_k$. If we go ahead and use column reductions to transform $\partial_k$ into a form where the kernel is easy to read off (as the columns consisting entirely of zeroes), then the corresponding row operations, when performed on $\partial_{k+1}$ will tell us exactly the image of $\partial_{k+1}$ inside the kernel of $\partial_k$.

This last point is true precisely because $\textup{im} \partial_{k+1} \subset \textup{ker} \partial_k$. This fact guarantees that the irrelevant rows of the reduced version of $\partial_{k+1}$ are all zero.

Let’s go ahead and see this in action on our two big matrices above. For $\partial_1$, the column reduction matrix is

$\displaystyle A = \begin{pmatrix} 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & -1 & 0 & 1 & 1\\ 0 & 0 & 0 & 1 & 0 & -1 & -1 & 0\\ -1 & -1 & -1 & -1 & 0 & 0 & 0 & -1\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix}$

And the product $\partial_1 A$ is

$\displaystyle \partial_1 A = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ -1 & -1 & -1 & -1 & 0 & 0 & 0 & 0 \end{pmatrix}$

Now the inverse of $A$, which is the corresponding basis change for $\partial_2$, is

$\displaystyle A^{-1} = \begin{pmatrix} -1 & -1 & -1 & -1 & -0 & -0 & -0 & -0\\ 1 & 0 & 0 & 0 & -1 & -1 & 0 & 0\\ 0 & 1 & 0 & 0 & 1 & 0 & -1 & -1\\ 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix}$

and the corresponding reduced form of $\partial_2$ is

$\displaystyle A^{-1} \partial_2 = \begin{pmatrix} 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 1 & 0 & 0 & 1\\ 0 & 1 & 0 & -1\\ 0 & 0 & 1 & 1\\ 0 & 0 & 0 & 0 \end{pmatrix}$

As a side note, we got these matrices by slightly modifying the code from our original post on row reduction to output the change of basis matrix in addition to performing row reduction. It turns out one can implement column reduction as row reduction of the transpose, and the change of basis matrix you get from this process will be the transpose of the change of basis matrix you want (by $(AB)^\textup{T} = (B^\textup{T}A^\textup{T})$). Though the code is particularly ad-hoc, we include it with the rest of the code used in this post on this blog’s Google Code page.

Now let’s inspect the two matrices $\partial_1 A$ and $A^{-1} \partial_2$ more closely. The former has four “pivots” left over, and this corresponds to the rank of the matrix being 4. Moreover, the four basis vectors representing the columns with nonzero pivots, which we’ll call $v_1, v_2, v_3, v_4$ (we don’t care what their values are), span a complementary subspace to the kernel of $\partial_1$. Hence, the remaining four vectors (which we’ll call $v_5, v_6, v_7, v_8$) span the kernel. In particular, this says that the kernel has dimension 4.

On the other hand, we performed the same transformation of the basis of $C_1(X)$ for $\partial_2$. Looking at the matrix that resulted, we see that the first four rows and the last row (representing $v_1, v_2, v_3, v_4, v_8$) are entirely zeros and so the image of $\partial_2$ intersects their span trivially. and the remaining three rows (representing $v_5, v_6, v_7$) have nonzero pivots. This tells us exactly that the image of $\partial_2$ is spanned by $v_5, v_6, v_7$.

And now, the coup de grâce, the quotient to get homology is simply

$\displaystyle \frac{ \textup{span} \left \{ v_5, v_6, v_7, v_8 \right \}}{ \textup{span} \left \{ v_5, v_6, v_7 \right \}} = \textup{span} \left \{ v_8 \right \}$

And the dimension of the homology group is 1, as desired.

## The General Algorithm

It is no coincidence that things worked out at nicely as they did. The process we took of simultaneously rewriting two matrices with respect to a common basis is the bulk of the algorithm to compute homology. Since we’re really only interested in the dimensions of the homology groups, we just need to count pivots. If the number of pivots arising in $\partial_k$ is $y$ and the number of pivots arising in $\partial_{k+1}$ is $z$, and the dimension of $C_k(X)$ is $n$, then the dimension is exactly

$(n-y) - z = \textup{dim}(\textup{ker} \partial_k) - \textup{dim}(\textup{im}\partial_{k+1})$

And it is no coincidence that the pivots lined up so nicely to allow us to count dimensions this way. It is a minor exercise to prove it formally, but the fact that the composition $\partial_k \partial_{k+1} = 0$ implies that the reduced version of $\partial_{k+1}$ will have an almost reduced row-echelon form (the only difference being the rows of zeros interspersed above, below, and possibly between pivot rows).

As the reader may have guessed at this point, we don’t actually need to compute $A$ and $A^{-1}$. Instead of this, we can perform the column/row reductions simultaneously on the two matrices. The above analysis helped us prove the algorithm works, and with that guarantee we can throw out the analytical baggage and just compute the damn thing.

Indeed, assuming the input is already processed as two matrices representing the boundary operators with respect to the standard bases of the chain groups, computing homology is only slightly more difficult than row reducing in the first place. Putting our homology where our mouth is, we’ve implemented the algorithm in Python. As usual, the entire code used in this post is available on this blog’s Google Code page.

The first step is writing auxiliary functions to do elementary row and column operations on matrices. For this post, we will do everything in numpy (which makes the syntax shorter than standard Python syntax, but dependent on the numpy library).

import numpy

def rowSwap(A, i, j):
temp = numpy.copy(A[i, :])
A[i, :] = A[j, :]
A[j, :] = temp

def colSwap(A, i, j):
temp = numpy.copy(A[:, i])
A[:, i] = A[:, j]
A[:, j] = temp

def scaleCol(A, i, c):
A[:, i] *= c*numpy.ones(A.shape[0])

def scaleRow(A, i, c):
A[i, :] *= c*numpy.ones(A.shape[1])

def colCombine(A, addTo, scaleCol, scaleAmt):
A[:, addTo] += scaleAmt * A[:, scaleCol]

def rowCombine(A, addTo, scaleRow, scaleAmt):
A[addTo, :] += scaleAmt * A[scaleRow, :]


From here, the main meat of the algorithm is doing column reduction on one matrix, and applying the corresponding row operations on the other.

def simultaneousReduce(A, B):
if A.shape[1] != B.shape[0]:
raise Exception("Matrices have the wrong shape.")

numRows, numCols = A.shape # col reduce A

i,j = 0,0
while True:
if i >= numRows or j >= numCols:
break

if A[i][j] == 0:
nonzeroCol = j
while nonzeroCol < numCols and A[i,nonzeroCol] == 0:
nonzeroCol += 1

if nonzeroCol == numCols:
j += 1
continue

colSwap(A, j, nonzeroCol)
rowSwap(B, j, nonzeroCol)

pivot = A[i,j]
scaleCol(A, j, 1.0 / pivot)
scaleRow(B, j, 1.0 / pivot)

for otherCol in range(0, numCols):
if otherCol == j:
continue
if A[i, otherCol] != 0:
scaleAmt = -A[i, otherCol]
colCombine(A, otherCol, j, scaleAmt)
rowCombine(B, j, otherCol, -scaleAmt)

i += 1; j+= 1

return A,B


This more or less parallels the standard algorithm for row-reduction (with the caveat that all the indices are swapped to do column-reduction). The only somewhat confusing line is the call to rowCombine, which explicitly realizes the corresponding row operation as the inverse of the performed column operation. Note that for row operations, the correspondence between operations on the basis and operations on the rows is not as direct as it is for columns. What’s given above is the true correspondence. Writing down lots of examples will reveal why, and we leave that as an exercise to the reader.

Then the actual algorithm to compute homology is just a matter of counting pivots. Here are two pivot-counting functions in a typical numpy fashion:

def numPivotCols(A):
z = numpy.zeros(A.shape[0])
return [numpy.all(A[:, j] == z) for j in range(A.shape[1])].count(False)

def numPivotRows(A):
z = numpy.zeros(A.shape[1])
return [numpy.all(A[i, :] == z) for i in range(A.shape[0])].count(False)


And the final function is just:

def bettiNumber(d_k, d_kplus1):
A, B = numpy.copy(d_k), numpy.copy(d_kplus1)
simultaneousReduce(A, B)

dimKChains = A.shape[1]
kernelDim = dimKChains - numPivotCols(A)
imageDim = numPivotRows(B)

return kernelDim - imageDim


And there we have it! We’ve finally tackled the beast, and written a program to compute algebraic features of a topological space.

The reader may be curious as to why we didn’t come up with a more full-bodied representation of a simplicial complex and write an algorithm which accepts a simplicial complex and computes all of its homology groups. We’ll leave this direct approach as a (potentially long) exercise to the reader, because coming up in this series we are going to do one better. Instead of computing the homology groups of just one simplicial complex using by repeating one algorithm many times, we’re going to compute all the homology groups of a whole family of simplicial complexes in a single bound. This family of simplicial complexes will be constructed from a data set, and so, in grandiose words, we will compute the topological features of data.

If it sounds exciting, that’s because it is! We’ll be exploring a cutting-edge research field known as persistent homology, and we’ll see some of the applications of this theory to data analysis.

Until then!

# A Sample of Standard ML, the TreeSort Algorithm, and Monoids

In this post we will assume the reader has a passing familiarity with some of the basic concepts of functional programming (the map, fold, and filter functions). We introduce these topics in our Racket primer, but the average reader will understand the majority of this primer without expertise in functional programming.

## Preface: ML for Category Theory

A few of my readers have been asking for more posts about functional languages and algorithms written in functional languages. While I do have a personal appreciation for the Haskell programming language (and I plan to do a separate primer for it), I have wanted to explore category theory within the context of programming for quite a while now. From what I can tell, ML is a better choice than Haskell for this.

Part of the reason is that, while many Haskell enthusiasts claim it to be a direct implementation of category theory, Haskell actually tweaks category theoretic concepts in certain ways. I rest assured that the designers of Haskell (who are by assumption infinitely better at everything than I am) have very good reasons for doing this. But rather than sort through the details of the Haskell language specification to clarify the finer details, we would learn a lot more by implementing category theory by hand in a programming language that doesn’t have such concepts already.

And so we turn to ML.

ML, which stands for MetaLanguage, apparently has historical precedents for being a first in many things. One of these is an explicit recognition of parametric polymorphism, which is the idea that an operation can have the same functionality regardless of the types of the data involved; the types can, in effect, be considered variables. Another ground-breaking aspect of ML is an explicit type inference system. Similar to Haskell, an ML program will not run unless the compiler can directly prove that the program produces the correct types in every step of the computation.

Both of these features are benefits for the student of category theory. Most of our time in category theory will be spent working with very general assumptions on the capabilities of our data involved, and parametric polymorphism will be our main tool for describing what these assumptions are and for laying out function signatures.

As a side note, I’ve noticed through my ever-growing teaching experiences that one of the main things new programming students struggle with (specifically, after mastering the syntax and semantics of basic language constructs) is keeping their types straight. This is especially prominent in a language like Python (which is what I teach), where duck-typing is so convenient that it lulls the students into a false sense of security. Sooner as opposed to later they’ll add strings to numbers with the blind confidence that Python will simply get it. Around this time in their first semester of programming, I would estimate that type errors lie at the heart of 75% of the bugs my students face and fail to resolve before asking me for help. So one benefit of programming in ML for pedagogy is that it is literally impossible to make type errors. The second you try to run a program with bad types, the compiler points out what the expected type is and what the given (incorrect) type was. It takes a while to get used to type variables (and appeasing the type checker when you want to play fast and loose). But once you do you’ll find the only bugs that remain in your code are conceptual ones, which are of course much more rewarding and important bugs to fix.

So enough of this preamble. Let’s learn some ML!

## Warming Up: Basic Arithmetic and Conditionals

We’ll be working with the Standard ML of New Jersey compiler, which you can download for free at their website. The file extension for ML files is .sml.

As one would expect, ML has variables and arithmetic which work in much the same way as other languages. Each variable declaration is prefixed by the word “val,” as below

val x = 7;
val y = 2;

This statements modify the global environment (the list of which variable names are associated to which values). Semicolons are required to terminate variable declarations at the global level. We can declare multiple variables in a single line using the “and” keyword

val x = 7 and y = 2;

As a precaution, “and” is only used in ML for syntactic conjunctions of variable/function declarations, and is only necessary when the two defined variable names are mutually defined in terms of each other (this can happen naturally for recursive function definitions). We will see in a moment that the logical and operation is denoted “andalso.”

We can also use pattern matching to bind these two variables in one line, much the same way as it might work in Python:

val (x,y) = (7,2);

We note that while ML does not require us to specify the type of a variable, the type is known and ever present under the surface. If we run the above code through the sml compiler (which after running the contents of a file opens a REPL to further evaluate commands), we see the following output

[opening vars.sml]
val x = 7 : int
val n = 2 : int

The environment is printed out to the user, and it displays that the two types are “int.”

Arithmetic is defined for integers, and the standard ones we will use are the expected +, -, *, and (not a slash, but) div. Here are some examples, and here is a list of all basic operations on ints. A few things to note: the unary negation operator is a tilde (~), and the semicolons are only used terminate statements in the REPL, which tells the compiler we’re ready for it to evaluate our code. Semicolons can also be used to place multiple statements on a single line of code. The “it” variable is a REPL construct which saves the most recent unbound expression evaluation.

- 3 + 6;
val it = 9 : int
- 6 div 3;
val it = 2 : int
- 2 * 9;
val it = 18 : int
- 2 - 9;
val it = ~7 : int
- ~9;
val it = ~9 : int

ML also has floating point arithmetic (in ML this type is called “real”), but treats it in a very prudish manner. Specifically (and this is a taste of the type checker doing its job too well), ML does not coerce types for you. If you want to multiply a real number and an integer, you have to first convert the int to a real and then multiply. An error will occur if you do not:

- val x = 4.0;
val x = 4.0 : real
- val y = 7;
val y = 7 : int
- x * y;
stdIn:5.1-5.6 Error: operator and operand don't agree [tycon mismatch]
operator domain: real * real
operand:         real * int
in expression:
x * y
- x * Real.fromInt(y);
val it = 28.0 : real

Here is a list of all operations on reals. We don’t anticipate using reals much, but it’s good to know that ML fervently separates them.

It seems odd that we’re talking so much about statements, because often enough we will be either binding function names (and tests) to the global environment, or restricting ourselves to local variable declarations. The latter has a slightly more complicated syntax, simply surrounding your variable declarations and evaluated code in a “let … in … end” expression. This will be a much more common construction for us.

let
val x = 7
val y = 9
in
(x + 2*y) div 3
end

The “in” expression is run with the combined variables from the ambient environment (the variables declared outside of the let) and those defined inside the let. The variables defined in the let leave scope after the “in” expression is evaluated, and the entire let expression as a whole evaluates to the result of evaluating the “in” expression. Clearly and example shows what is going on much more directly than words.

The last major basic expression form are the boolean expressions and operations. The type for booleans in ML is called “bool,” and the two possible values are “true,” and “false.” They have the usual unary and binary operations, but the names are a bit weird. Binary conjunction is called “andalso,” while binary disjunction is called “orelse.”

val a = true and b = false;
val c = (a andalso b) orelse ((not a) andalso (not b));

But aside from that, boolean expressions work largely as one would expect. There are the six standard numerical comparison functions, where testing for equality is given by a single equals sign (in most languages, comparison for equality is ==), and inequality is given by the diamond operator <>. The others are, as usual, <, <=, >, >=.

- 6 = 7;
val it = false : bool
- 6 = 6;
val it = true : bool
- 6 < 7;
val it = true : bool
- 7 <= 6;
val it = false : bool
- 6 <> 7;
val it = true : bool

ML also has the standard if statement, which has the following syntax, which is more or less the same as most languages:

- val x = if 6 < 7 then ~1 else 4;
val x = ~1 : int

ML gives the programmer more or less complete freedom with whitespace, so any of these expressions can be spread out across multiple lines if the writer desires.

val x = if 6 < 7
then
~1
else
4

This can sometimes be helpful when defining things inside of let expressions inside of function definitions (inside of other function definitions, inside of …).

So far the basics are easy, if perhaps syntactically different from most other languages we’re familiar with. So let’s move on to the true heart of ML and all functional programming languages: functions.

## Functions and cases, recursion

Now that we have basic types and conditions, we can start to define some simple functions. In the global environment, functions are defined the same way as values, using the word “fun” in the place of “val.” For instance, here is a function that adds 3 to a number.

fun add3(x) = x+3

The left hand side is the function signature, and the right hand side is the body expression. As in Racket, and distinct from most imperative languages, a function evaluates to whatever the body expression evaluates to. Calling functions has two possible syntaxes:

add3(5)
add3 5
(add3 5)

In other words, if the function application is unambiguous, parentheses aren’t required. Otherwise, one can specify precedence using parentheses in either Racket (Lisp) style or in standard mathematical style.

The most significant difference between ML and most other programming languages, is that ML’s functions have case-checking. That is, we can specify what action is to be taken based on the argument, and these actions are completely disjoint in the function definition (no if statements are needed).

For instance, we could define an add3 function which nefariously does the wrong thing when the user inputs 7.

fun add3(7) = 2
| add3(x) = x+3

The vertical bar is read “or,” and the idea is that the possible cases for the function definition must be written in most-specific to least-specific order. For example, interchanging the orders of the add3 function cases gives the following error:

- fun add3(x) = x+3
=   | add3(7) = 2;
stdIn:13.5-14.16 Error: match redundant
x => ...
-->   7 => ...

Functions can call themselves recursively, and this is the main way to implement loops in ML. For instance (and this is quite an inefficient example), I could define a function to check whether a number is even as follows.

fun even(0) = true
| even(n) = not(even(n-1))

Don’t cringe too visibly; we will see recursion used in less horrifying ways in a moment.

Functions with multiple arguments are similarly easy, but there are two semantic possibilities for how to define the arguments. The first, and simplest is what we would expect from a typical language: put commas in between the arguments.

fun add(x,y) = x+y

When one calls the add function, one is forced to supply both arguments immediately. This is usually how programs are written, but often times it can be convenient to only supply one argument, and defer the second argument until later.

If this sounds like black magic, you can thank mathematicians for it. The technique is called currying, and the idea stems from the lambda calculus, in which we can model all computation using just functions (with a single argument) as objects, and function application. Numbers, arithmetic, lists, all of these things are modeled in terms of functions and function calls; the amazing thing is that everything can be done with just these two tools. If readers are interested, we could do a post or two on the lambda calculus to see exactly how these techniques work; the fun part would be that we can actually write programs to prove the theorems.

Function currying is built-in to Standard ML, and to get it requires a minor change in syntax. Here is the add function rewritten in a curried style.

fun add(x)(y) = x+y

Now we can, for instance, define the add3 function in terms of add as follows:

val add3 = add(3)

And we can curry the second argument by defining a new function which defers the first argument appropriately.

fun add6(x) = add(x)(6)

Of course, in this example addition is commutative so which argument you pick is useless.

We should also note that we can define anonymous functions as values (for instance, in a let expression) using this syntax:

val f = (fn x => x+3)

The “fn x => x+3″ is just like a “lambda x: x+3″ expression in Python, or a “(lambda (x) (+ x 3))” in Racket. Note that one can also define functions using the “fun” syntax in a let expression, so this is truly only for use-once function arguments.

## Tuples, Lists, and Types

As we’ve discovered already, ML figures out the types of our expressions for us. That is, if we define the function “add” as above (in the REPL),

- fun add(x,y) = x+y;
val add = fn : int * int -> int

then ML is smart enough to know that “add” is to accept a list of two ints (we’ll get to what the asterisk means in a moment) and returns an int.

The curried version is similarly intuited:

- fun add(x)(y) = x+y;
val add = fn : int -> int -> int

The parentheses are implied here: int -> (int -> int). So that this is a function which accepts as input an int, and produces as output another function which accepts an int and returns an int.

But, what if we’d like to use “add” to add real numbers? ML simply won’t allow us to; it will complain that we’re providing the wrong types. In order to make things work, we can tell ML that the arguments are reals, and it will deduce that “+” here means addition on reals and not addition on ints. This is one awkward thing about ML; while the compiler is usually able to determine the most general possible type for our functions, it has no general type for elements of a field, and instead defaults to int whenever it encounters arithmetic operations. In any case, this is how to force a function to have a type signature involving reals:

- fun add(x:real, y:real) = x+y;
val add = fn : real * real -> real

If we’re going to talk about types, we need to know all of ML’s syntax for its types. Of course there are the basics (int, real, bool). Then there are function types: int -> int is the type of a function which accepts one int and returns an int.

We’ll see two new types in this section, and the first is the tuple. In ML, tuples are heterogeneous, so we can mix types in them. For instance,

- val tup = (true, 7, 4.4);
val tup = (true,7,4.4) : bool * int * real

Here the asterisk denotes the tuple type, and one familiar with set theory can think of a tuple as an element of the product of sets, in this case

$\displaystyle \left \{ \textup{true}, \textup{false} \right \} \times \mathbb{Z} \times \mathbb{R}$

Indeed, there is a distinct type for each possible kind of tuple. A tuple of three ints (int * int * int) is a distint type from a tuple of three booleans (bool * bool * bool). When we define a function that has multiple arguments using the comma syntax, we are really defining a function which accepts as input a single argument which is a tuple. This parallels exactly how functions on multiple arguments work in classical mathematics.

The second kind of compound type we’ll use quite often is the list. Lists are distinct from tuples in ML in that lists must be homogenous. So a list of integers (which has type “int list”) is different from a list of booleans.

The operations on lists are almost identical as in Haskell. To construct explicit lists use square brackets with comma-delimited elements. To construct them one piece at a time, use the :: list constructor operation. For those readers who haven’t had much experience with lists in functional programming: all lists are linked lists, and the :: operation is the operation of appending a single value to the beginning of a given list. Here are some examples.

- val L = [1,2,3];
val L = [1,2,3] : int list

- val L = 1::(2::(3::nil));
val L = [1,2,3] : int list

The “nil” expression is the empty list, as is the empty-square-brackets expression “[]“.

There is a third kind of compound type called the record, and it is analogous to a C struct, where each field is named. We will mention this in the future once we have a need for it.

The most important thing about lists and tuples in ML is that functions which operate on them don’t always have an obvious type signature. For instance, here is a function which takes in a tuple of two elements and returns the first element.

fun first(x,y) = x

What should the type of this function be? We want it to be able to work with any tuple, no matter the type. As it turns out, ML was one of the first programming language to allow this sort of construction to work. The formal name is parametric polymorphism. It means that a function can operate on many different kinds of types (because the actual computation is the same regardless of the types involved) and the full type signature can be deduced for each function call based on the parameters for that particular call.

The other kind of polymorphism is called ad-hoc polymorphism. Essentially this means that multiple (very different) operations can have the same name. For instance, addition of integers and addition of floating point numbers require two very different sets of instructions, but they both go under the name of +.

What ML does to make its type system understand parametric polymorphism is introduce so-called type variables. A type variable is any string prefixed by a single quote, e.g. ‘a, and they can represent any type. When ML encounters a function with an ambiguous type signature, it decides what the most general possible type is (which usually involves a lot of type variables), and then uses that type.

So to complete our example, the first function has the type

'a * 'b -> 'a

As a side note, the “first” function is in a sense the “canonical” operation that has this type signature. If nothing is known about the types, then no other action can happen besides the projection. There are some more interesting things to be said about such canonical operations (for instance, could we get away with not having to even define them at all).

The analogous version for lists is as follows. Note that in order to decompose a list into its first element and the tail list, we need to use pattern matching.

fun listFirst([x]) = x
| listFirst(head::tail) = head

And this function has the type signature ‘a list -> ‘a. As a slightly more complicated example (where we need recursion), we can write a function to test for list membership.

fun member(x, nil) = false
| member(x, (head::tail)) = if x = head then true
else member(x, tail)

If you run this program and see some interesting warning messages, see this StackOverflow question for a clarification.

## Defining New Types

The simplest way to define a new type is to just enumerate all possibilities. For instance, here is an enumerated datatype with four possibilities.

datatype maths = algebra | analysis | logic | computation

Then we can define functions which operate on those types using pattern matching.

fun awesome(algebra) = true
| awesome(analysis) = false
| awesome(logic) = false
| awesome(computation) = true

And this function has type maths -> bool (don’t take it too seriously ). We can also define data types whose constructors require arguments.

datatype language = functional of string*bool
| imperative of string
| other

Here we define a language to be functional or imperative. The functional type consists of a name and a boolean representing whether it is purely functional, while the imperative type just consists of a name. We can then construct these types by treating the type constructors as if they were functions.

val haskell = functional("Haskell", true) and
java = imperative("Java") and
prolog = other;

Perhaps more useful than this is to define types using type variables. A running example we will use for the remainder of this post is a binary tree of homogeneous elements at each node. Defining such types is easy: all one needs to do is place the appropriate type (with parentheses to clarify when the description of a type starts or ends) after the “of” keyword.

datatype 'a Tree = empty
| leaf of 'a
| node of (('a Tree) * 'a * ('a Tree))

We can create instances of an integer tree as expected:

val t2 = node(node(leaf(2), 3, leaf(4)), 6, leaf(8))

## The TreeSort Algorithm

We can define a host of useful operations on trees (of any type). For instance, below we compute the breadth of a tree (the total number of leaves), the depth of a tree (the maximal length of a path from the root to a leaf), and the ability to flatten a tree (traverse the tree in order and place all of the values into a list). These first two are a nice warm-up.

fun breadth(empty) = 0
| breadth(leaf(_)) = 1
| breadth(node(left, _, right)) = breadth(left) + breadth(right)

Here the underscore indicates a pattern match with a variable we won’t actually use. This is more space efficient for the compiler; it can avoid adding extra values to the current environment in a potentially deep recursion.

fun depth(empty) = 0
| depth(leaf(_)) = 1
| depth(node(left, _, right)) =
let
val (lDepth, rDepth) = (1 + depth(left), 1 + depth(right))
in
if lDepth > rDepth then lDepth else rDepth
end

This function should be self explanatory.

fun flatten(empty) = []
| flatten(leaf(x)) = [x]
| flatten(node(left, x, right)) =
flatten(left) @ (x :: flatten(right))

Here the @ symbol is list concatenation. This is not quite the most efficient way to do this (we are going to write a forthcoming post about tail-call optimization, and there we will see why), but it is certainly the clearest. In the final recursive call, we traverse the left subtree first, flattening it into a list in order. Then we flatten the right hand side, and put the current node’s element in between the two flattened parts.

Note that if our tree is ordered, then flatten will produce a strictly increasing list of the elements in the tree. For those readers unfamiliar with ordered binary trees, for all intents and purposes this is an “int Tree” and we can compare the values at different nodes. Then an ordered binary tree is a tree which satisfies the following property for each node: all of the values in the left child’s subtree are strictly smaller than the current node’s value, and all of the values in the right child’s subtree are greater than or equal to the current node’s value.

Indeed, we can use the flatten function as part of a simple algorithm to sort lists of numbers. First we insert the numbers in the unsorted list into a tree (in a way that preserves the ordered property at each step), and then we flatten the tree into a sorted list. This algorithm is called TreeSort, and the insert function is simple as well.

fun insert(x, empty) = leaf(x)
| insert(y, leaf(x)) = if x <= y
then node(empty, x, leaf(y))
else node(leaf(y), x, empty)
| insert(y, node(left, x, right)) =
if x <= y
then node(left, x, insert(y, right))
else node(insert(y, left), x, right)

If we’re at a nonempty node, then we just recursively insert into the appropriate subtree (taking care to create new interior nodes if we’re at a leaf). Note that we do not do any additional computation to ensure that the tree is balanced (that each node has approximately as many children in its left subtree as in its right). Doing so would digress from the point of this primer, but rest assured that the problem of keeping trees balanced has long been solved.

Now the process of calling insert on every element of a list is just a simple application of fold. ML does have the standard map, foldl, foldr, and filter functions, although apparently filter is not available in the standard library (one needs to reference the List module via List.filter).

In any case, foldl is written in the currying style and building the tree is a simple application of it. As we said, the full sorting algorithm is just the result of flattening the resulting tree with our in-order traversal.

fun sort(L) = flatten(foldl(insert)(empty)(L))

So there you have it! One of the simplest (efficient) sorting algorithms I can think of in about twelve lines of code.

## Free Monoid Homomorphisms: A More Advanced Example

Just to get a quick taste of what our series on category theory will entail, let’s write a program with a slightly more complicated type signature. The idea hinges on the idea that lists form a what’s called a free monoid. In particular,

Definition: monoid is a set $X$ equipped with an associative binary operation $\cdot: X \times X \to X$ and an identity element $1 \in X$ for which $x1 = 1x = x$ for all $x \in X$.

Those readers who have been following our series on group theory will recognize a monoid as a group with less structure (there is no guarantee of inverses for the $\cdot$ operation). The salient fact for this example is that the set of ML values of type $\textup{'a list}$ forms a monoid. The operation is list concatenation, and the identity element is the empty list. Call the empty list nil and the append operation @, as it is in ML.

More than that, $\textup{'a list}$ forms a free monoid, and the idea of freeness has multiple ways of realization. One sort of elementary way to understand freeness is that the only way to use the binary operation to get to the identity is to have the two summands both be the identity element. In terms of lists, the only way to concatenate two lists to get the empty list is to have both pieces be empty to begin with.

Another, more mature (more category-theoretical) way to think about freeness is to say is satisfies the following “universal” property. Call $A$ the set of values of type ‘a, and $[A]$ the set of values of type $\textup{'a list}$, and suppose that $(B, \cdot_B, 1_B)$ is the datum of an arbitrary monoid. The universal property says that if we are given a function $f: A \to B$, and we take the canonical map $g: A \to [A]$ which maps an element $a \in A$ to the single-entry list $[a] \in [A]$, then there is a unique way to extend $f$ to a monoid homomorphism $f^* : [A] \to B$ on lists, so that $f^*g = f$. We have mentioned monoid homomorphisms on this blog before in the context of string metrics, but briefly a monoid homomorphism respects the monoid structure in the sense that (for this example) $f^*(a \textup{ @ } b) = f^*(a) \cdot_B f^*(b)$ no matter what $a, b$ are.

This was quite a mouthful, but the data is often written in terms of a so-called “commutative diagram,” whose general definition we will defer to a future post. The diagram for this example looks like:

The dashed line says we are asserting the existence of $f^*$, and the symbol $\exists !$ says this function exists and is uniquely determined by $f, g$. The diagram “commutes” in the sense that traveling from $A$ to $B$ along $f$ gives you the same computational result as traveling by $g$ and then $f^*$. The reason for the word “universal” will become clear in future posts, but vaguely it’s because the set $[A]$ is a unique “starting place” in a special category.

If this talk is too mysterious, we can just go ahead and prove that $f^*$ exists by writing a program that computes the function transforming $f \mapsto f^*$. We call the function “listMonoidLift” because it “lifts” the function $f$ from just operating on $A$ to the status of a monoid homomorphism. Very regal, indeed.

Part of the beauty of this function is that a number of different list operations (list length, list sum, member tests, map, etc.), when viewed under this lens, all become special cases of this theorem! By thinking about things in terms of monoids, we write less code, and more importantly we recognize that these functions all have the same structural signature. Perhaps one can think of it like parametric polymorphism on steroids.

fun listMonoidLift(f:('a->'b), (combine:(('b * 'b) -> 'b), id:'b)) =
let
fun f'(nil) = id
| f'(head::tail) = combine(f(head), f'(tail))
in
f'
end

Here we specified the types of the input arguments to be completely clear what’s what. The first argument is our function $f$ as in the above diagram, and the second two arguments together form the data of our monoid (the set $B$ is implicitly the collection of types $'b$ determined at the time of the function call). Now let’s see how the list summation and list length functions can be written in terms of the listMonoidLift function.

fun plus(x, y) = x + y
val sum = listMonoidLift((fn x => x), (plus, 0))
val length = listMonoidLift((fn x => 1), (plus, 0))

The plus function on integers with zero as the identity is the monoid $B$ in both cases (and also happens to be $A$ by coincidence), but in the summation case the function $f$ is the identity and for length it is the constant $1$ function.

As a more interesting example, see how list membership is a lift.

fun member(x) = listMonoidLift((fn y => y = x),
((fn (a,b) => a orelse b), false))

Here the member function is curried; it has type ‘a -> ‘a list -> bool (though it’s a bit convoluted since listMonoidLift is what’s returning the ‘a list -> bool part of the type signature). Here the $B$ monoid is the monoid of boolean values, where the operation is logical “or” and the identity is false. It is a coincidence and a simple exercise to prove that $B$ is a free monoid as well.

Now the mapping $f(y)$ is the test to see if $y$ is the same object as $x$. The lift to the list monoid will compute the logical “or” of all evaluations of $f$ on the values.

Indeed, (although this author hates bringing up too many buzzwords where they aren’t entirely welcome) the monoid lifting operation we’ve just described is closely related to the MapReduce framework (without all the distributed computing parts). Part of the benefit of MapReduce is that the programmer need only define the Map() and Reduce() functions (the heart of the computation) and MapReduce does the rest. What this example shows is that defining the Map() function can be even simpler: one only needs define the function $f$, and Map() is computed as $f^*$ automatically. The Reduce() part is simply the definition of the target monoid $B$.

Just to drive this point home, we give the reader a special exercise: write map as a special case of listMonoidLift. The result (the map function) should have one of the two type signatures:

map : ('a -> 'b) * ('a list) -> 'b list
map : 'a -> 'b -> 'a list -> b' list

As a hint, the target monoid should also be a list monoid.

Part of why this author is hesitant to bring up contemporary software in discussing these ideas is because the ideas themselves are far from contemporary. Burstall and Landin’s, 1969 text Programs and Their Proofs (which this author would love to find a copy of) details this exact reduction and other versions of this idea in a more specific kind of structure called “free algebras.” So MapReduce (minus the computer programs and distributed systems) was a well-thought-out idea long before the internet or massively distributed systems were commonplace.

In any case, we’ll start the next post in this series right off with the mathematical concept of a category. We’ll start slow and detail many examples of categories that show up both in (elementary) mathematics and computing, and then move on to universal properties, functors, natural transformations, and more, implementing the ideas all along the way.

Until then!

# Support Math ∩ Programming: T-Shirts, Mugs, Donations

Being a graduate student is more time consuming than most people realize. Between learning very difficult material and attending classes, teaching your own classes, grading, going to seminars and giving talks, negotiating the graduate employee contract, studying for qualifying exams, publishing research papers, travelling to conferences, and regularly finding that quiet four hours needed to really make progress in thinking about a problem, it can be a challenge to find time to do things like blog.

And of course, there’s always the issue of salary. Currently the University of Illinois at Chicago estimates the cost of living in Chicago and attending UIC is around $17,000 per year. Ironically (and we’ve brought this up in our contract negotiations many times), the standard wage for graduate students here is only about$15,000. The math department here scraps together enough to bump our salaries up a bit more (mostly to account for extra fees we pay), but I find myself continually taking on extra jobs on the side to supplement my income. Most prominently this takes the form of competitive summer research internships, but these are by no means guaranteed. During the school year I find additional web design work, e.g., for local high school math organizations, to supplement my income.

But all of this time spent looking for additional sources of income, and doing the jobs themselves, seriously limits my blogging time! Since my blog has grown over the past two years to gain a modest readership, I would absolutely love to devote more of my time to writing and producing increasingly high-quality content. Nothing makes me happier than spreading knowledge, doing math, and writing awesome programs. The support and feedback from my readers has shown that my blog has become a valuable international resource.

[And I'm so pleased that my readers are so diverse! I'd like to shout out to my South African readers (yes, I've noticed you; you're making quite a large dent in my statistics). I just got my first hit from Tajikistan recently. Sadly nothing from Iran or North Korea yet, but I can dream.]

Here are some examples of topics I’ve planned to write about for a while:

• Primers on the basic data structures; structured trees (BSTs, quadtrees, prefix trees), hash tables, priority queues, etc.
• More on Fourier transforms, specifically extending to higher dimensions and doing image processing algorithms as common in Gimp and Photoshop. Continue with wavelet transforms and other more modern advances in signal processing.
• A whole host of cutting-edge machine learning topics: support vector machines (implemented from scratch), boosting, multi-armed bandit problems, bayesian classification, deep learning and sparse autoencoders.
• Randomized and streaming algorithms. Graph property testing.
• Computational category theory.
• Social network analysis.
• Computational geometry and applications.
• More programming primers, specifically ML, Haskell, Java, C.
• Guest posts from my colleagues in topics like: elliptic curve cryptography/factorization, financial mathematics, and baking
• Video tutorials of mathematics in action! Sometimes hearing a voice explain a topic is much easier than reading it.

If any of these topics sound great, and you want to see them come sooner, or you just want to support this blog, then you can make that happen by either buying some of the cool merchandise I designed, or by donating directly via PayPal.

Additional reasons to support:

• I blog out of the goodness of his heart and his excitement for mathematics and programming.
• All revenue directly supports the blog.
• High quality content on technical subjects: each blog post ranges from 2,000 – 6,000 words in length, with loads of hand-crafted diagrams and source code explanations.
• All source code files and data sets are released for free on his Google Code page, so you have the entire working examples at your fingertips.
• With enough donations he can afford to spend more time blogging and upgrade his blog’s appearance and functionality (e.g. include live Javascript demos instead of linking to an external site).

## Shirts and Mugs!

Okay, so I’ll admit I am no graphic designer (if any readers who are graphic designers want to submit awesome designs, I’d love to see them). But I have to admit, sitting down on a quiet Sunday morning to write some proofs or crunch out some programs is much more enjoyable with my Math ∩ Programming mug.

The mug has this blog’s lambda keyboard key logo (a larger version of the favicon) on the other side, which you can see by going to the product page. This one is actually a draft I got to make sure the colors turned out right (they didn’t and I had to modify them). I wasn’t satisfied with the text layout either (my colleagues kept thinking it meant $\textup{Math} \subset \textup{Programming}$).

I’ve also designed a number of t-shirts, with various designs such as this one:

It’s supposed to look like a bunch of keyboard keys, with rows representing concepts and columns from left to right representing the progression of a concept from theory to practice. I had a list of about twelve ideas for “rows,” but for many of them I couldn’t figure out a sensible entry for one of the columns :-/

So far this is the only t-shirt I’ve put up for sale, as I’m not quite satisfied with the others yet. Designs are in the works for additional t-shirts and sweatshirts, and things like drawstring bags. Any suggestions from fans of this blog would be extremely welcome.

## Donate

If you don’t like any of these designs, but you still want to support Math ∩ Programming, I’ve set up a PayPal account for donations. You are welcome to donate as little or as much as you’d like, and all of the proceeds will go toward improving this blog.  I personally feel like donating is much more fun when you get some swag for it, but I imagine most of my readers will have a plethora of mugs and t-shirts, and may not want any more. And of course this option is much more efficient.

Thanks once again to all my wonderful readers. Your readership really fuels this blog and makes writing it that much more rewarding. I can’t wait to see the mathematical and programming treasures this blog leads us to next.

Until next time!