Rings — A Primer

Previously on this blog, we’ve covered two major kinds of algebraic objects: the vector space and the group. There are at least two more fundamental algebraic objects every mathematician should something know about. The first, and the focus of this primer, is the ring. The second, which we’ve mentioned briefly in passing on this blog, is the field. There are a few others important to the pure mathematician, such as the R-module (here R is a ring). These do have some nice computational properties, but in order to even begin to talk about them we need to know about rings.

A Very Special Kind of Group

Recall that an abelian group (G, +) is a set G paired with a commutative binary operation +, where G has a special identity element called 0 which acts as an identity for +. The archetypal example of an abelian group is, of course, the integers \mathbb{Z} under addition, with zero playing the role of the identity element.

The easiest way to think of a ring is as an abelian group with more structure. This structure comes in the form of a multiplication operation which is “compatible” with the addition coming from the group structure.

Definition:ring (R, +, \cdot) is a set R which forms an abelian group under + (with additive identity 0), and has an additional associative binary operation \cdot with an element 1 serving as a (two-sided) multiplicative identity. Furthermore, \cdot distributes over + in the sense that for all x,y,z \in R

x(y+z) = xy + xz and (y+z)x = yx + zx

The most important thing to note is that multiplication is not commutative both in general rings and for most rings in practice. If multiplication is commutative, then the ring is called commutative. Some easy examples of commutative rings include rings of numbers like \mathbb{Z}, \mathbb{Z}/n\mathbb{Z}, \mathbb{Q}, \mathbb{R}, which are just the abelian groups we know and love with multiplication added on.

If the reader takes anything away from this post, it should be the following:

Rings generalize arithmetic with integers.

Of course, this would imply that all rings are commutative, but this is not the case. More meaty and tempestuous examples of rings are very visibly noncommutative. One of the most important examples are rings of matrices. In particular, denote by M_n(\mathbb{R}) the set of all n \times n matrices with real valued entries. This forms a ring under addition and multiplication of matrices, and has as a multiplicative identity the n \times n identity matrix I_n.

Commutative rings are much more well-understood than noncommutative rings, and the study of the former is called commutative algebra. This is the main prerequisite for fields like algebraic geometry, which (in the simplest examples) associate commutative rings to geometric objects.

For us, all rings will have an identity, but many ring theorists will point out that one can just as easily define a ring to not have a multiplicative identity. We will call these non-unital rings, and will rarely, if ever, see them on this blog.

Another very important example of a concrete ring is the polynomial ring in n variables with coefficients in \mathbb{Q} or \mathbb{R}. This ring is denoted with square brackets denoting the variables, e.g. \mathbb{R}[x_1, x_2, \dots , x_n]. We rest assured that the reader is familiar with addition and multiplication of polynomials, and that this indeed forms a ring.

Kindergarten Math

Let’s start with some easy properties of rings. We will denote our generic ring by R.

First, the multiplicative identity of a ring is unique. The proof is exactly the same as it was for groups, but note that identities must be two-sided for this to work. If 1, 1' are two identities, then 1 = 1 \cdot 1' = 1'.

Next, we prove that 0a = a0 = 0 for all a \in R. Indeed, by the fact that multiplication distributes across addition, 0a = (0 + 0)a = 0a + 0a, and additively canceling 0a from both sides gives 0a = 0. An identical proof works for a0.

In fact, pretty much any “obvious property” from elementary arithmetic is satisfied for rings. For instance, -(-a) = a and (-a)b = a(-b) = -(ab) and (-1)^2 = 1 are all trivial to prove. Here is a list of these and more properties which we invite the reader to prove.

Zero Divisors, Integral Domains, and Units

One thing that is very much not automatically given in the general theory of rings is multiplicative cancellation. That is, if I have ac = bc then it is not guaranteed to be the case that a = b. It is quite easy to come up with examples in modular arithmetic on integers; if R = \mathbb{Z}/8\mathbb{Z} then 2*6 = 6*6 = 4 \mod 8, but 2 \neq 6.

The reason for this phenomenon is that many rings have elements that lack multiplicative inverses. In \mathbb{Z}/8\mathbb{Z}, for instance, 2 has no multiplicative inverse (and neither does 6). Indeed, one is often interested in determining which elements are invertible in a ring and which elements are not. In a seemingly unrelated issue, one is interested in determining whether one can multiply any given element x \in R by some y to get zero. It turns out that these two conditions are disjoint, and closely related to our further inspection of special classes of rings.

Definition: An element x of a ring R is said to be a left zero-divisor if there is some y \neq 0 such that xy = 0. Similarly, x is a right zero-divisor if there is a z for which zx = 0. If x is a left and right zero-divisor (e.g. if R is commutative), it is just called a zero-divisor.

Definition: Let x,y \in R. The element y is said to be a left inverse to x if yx = 1, and a right inverse if xy = 1. If there is some z \neq 0 for which xz = zx = 1, then x is said to be a two-sided inverse and z is called the inverse of x, and x is called a unit.

As a quick warmup, we prove that if x has a left  and a right inverse then it has a two-sided inverse. Indeed, if zx = 1 = xy, then z = z(xy) = (zx)y = y, so in fact the left and right inverses are the same.

The salient fact here is that having a (left- or right-) inverse allows one to do (left- or right-) cancellation, since obviously when ac = bc and c has a right inverse, we can multiply acc^{-1} = bcc^{-1} to get a=b. We will usually work with two-sided inverses and zero-divisors (since we will usually work in a commutative ring). But in non-commutative rings, like rings of matrices, one-sided phenomena do run rampant, and one must distinguish between them.

The right way to relate these two concepts is as follows. If c has a right inverse, then define the right-multiplication function (- \cdot c) : R \to R which takes x and spits out xc. In fact, this function is an injection. Indeed, we already proved that (because c has a right inverse) if xc = yc then x = y. In particular, there is a unique preimage of 0 under this map. Since 0c = 0 is always true, then it must be the case that the only way to left-multiply c times something to get zero is 0c. That is, c is not a right zero-divisor if right-multiplication by c is injective. On the other hand, if the map is not injective, then there are some x \neq y such that xc = yc, implying (x-y)c = 0, and this proves that c is a right zero-divisor. We can do exactly the same argument with left-multiplication.

But there is one minor complication: what if right-multiplication is injective, but c has no inverses? It’s not hard to come up with an example: 2 as an element of the ring of integers \mathbb{Z} is a perfectly good one. It’s neither a zero-divisor nor a unit.

This basic study of zero-divisors gives us some natural definitions:

Definition: A division ring is a ring in which every element has a two-sided inverse.

If we allow that R is commutative, we get something even better (more familiar: \mathbb{Q}, \mathbb{R}, \mathbb{C} are the standard examples of fields).

Definition: field is a nonzero commutative division ring.

The “nonzero” part here is just to avoid the case when the ring is the trivial ring (sometimes called the zero ring) with one element. i.e., the set \left \{ 0 \right \} is a ring in which zero satisfies both the additive and multiplicative identities. The zero ring is excluded from being a field for silly reasons: elegant theorems will hold for all fields except the zero ring, and it would be messy to require every theorem to add the condition that the field in question is nonzero.

We will have much more to say about fields later on this blog, but for now let’s just state one very non-obvious and interesting result in non-commutative algebra, known as Wedderburn’s Little Theorem.

Theorem: Every finite divison ring is a field.

That is, simply having finitely many elements in a division ring is enough to prove that multiplication is commutative. Pretty neat stuff. We will actually see a simpler version of this theorem in a moment.

Now as we saw units and zero-divisors are disjoint, but not quite opposites of each other. Since we have defined a division ring as a ring where all (non-zero) elements are units, it is natural to define a ring in which the only zero-divisor is zero. This is considered a natural generalization of our favorite ring \mathbb{Z}, hence the name “integral.”

Definition: An integral domain is a commutative ring in which zero is the only zero-divisor.

Note the requirement that the ring is commutative. Often we will simply call it a domain, although most authors allow domains to be noncommutative.

Already we can prove a very nice theorem:

Theorem: Every finite integral domain is a field.

Proof. Integral domains are commutative by definition, and so it suffices to show that every non-zero element has an inverse. Let R be our integral domain in question, and x \in R the element whose inverse we seek. By our discussion of above, right multiplication by x is an injective map R \to R, and since R is finite this map must be a bijection. Hence x must have some y \neq 0 so that yx = 1. And so y is the inverse of x.

\square

We could continue traveling down this road of studying special kinds of rings and their related properties, but we won’t often use these ideas on this blog. We do think the reader should be familiar with the names of these special classes of rings, and we will state the main theorems relating them.

Definition: A nonzero element p \in R is called prime if whenever p divides a product ab it either divides a or b (or both). A unique factorization domain (abbreviated UFD) is an integral domain in which every element can be written uniquely as a product of primes.

Definition:Euclidean domain is a ring in which the division algorithm can be performed. That is, there is a norm function | \cdot | : R \ \left \{ 0 \right \} \to \mathbb{N}, for which every pair a,b \neq 0 can be written as a = bq + r with r satisfying |r| < |b|.

Paolo Aluffi has a wonderful diagram showing the relations among the various special classes of integral domains. This image comes from his book, Algebra: Chapter 0, which is a must-have for the enterprising mathematics student interested in algebra.

rings

In terms of what we have already seen, this diagram says that every field is a Euclidean domain, and in turn every Euclidean domain is a unique factorization domain. These are standard, but non-trivial theorems. We will not prove them here.

The two big areas in this diagram we haven’t yet mentioned on this blog are PIDs and Noetherian domains. The reason for that is because they both require a theory of ideals in rings (perhaps most briefly described as a generalization of the even numbers). We will begin next time with a discussion of ideals, and their important properties in studying rings, but before we finish we want to focus on one main example that will show up later on this blog.

Polynomial Rings

Let us formally define the polynomial ring.

Definition: Let R be a commutative ring. Define the ring R[x], to be the set of all polynomials in x with coefficients in R, where addition and multiplication are the usual addition and multiplication of polynomials. We will often call R[x] the polynomial ring in one variable over R.

We will often replace x by some other letter representing an “indeterminate” variable, such as t, or y, or multiple indexed variables as in the following definition.

Definition: Let R be a commutative ring. The ring R[x_1, x_2, \dots, x_n] is the set of all polynomials in the n variables with the usual addition and multiplication of polynomials.

What can we say about the polynomial ring in one variable R[x]? It’s additive and multiplicative identities are clear: the constant 0 and 1 polynomials, respectively. Other than that, we can’t quite get much more. There are some very bizarre features of polynomial rings with bizarre coefficient rings, such as multiplication decreasing degree.

However, when we impose additional conditions on R, the situation becomes much nicer.

Theorem: If R is a unique factorization domain, then so is R[x].

Proof. As we have yet to discuss ideals, we refer the reader to this proof, and recommend the reader return to it after our next primer.

\square

On the other hand, we will most often be working with polynomial rings over a field. And here the situation is even better:

Theorem: If k is a field, then k[x] is a Euclidean domain.

Proof. The norm function here is precisely the degree of the polynomial (the highest power of a monomial in the polynomial). Then given f,g, the usual algorithm for polynomial division gives a quotient and a remainder q, r so that f = qg + r. In following the steps of the algorithm, one will note that all multiplication and division operations are performed in the field k, and the remainder always has a smaller degree than the quotient. Indeed, one can explicitly describe the algorithm and prove its correctness, and we will do so in full generality in the future of this blog when we discuss computational algebraic geometry.

\square

For multiple variables, things are a bit murkier. For instance, it is not even the case that k[x,y] is a euclidean domain. One of the strongest things we can say originates from this simple observation:

Lemma: R[x,y] is isomorphic to R[x][y].

We haven’t quite yet talked about isomorphisms of rings (we will next time), but the idea is clear: every polynomial in two variables x,y can be thought of as a polynomial in y where the coefficients are polynomials in x (gathered together by factoring out common factors of y^k). Similarly, R[x_1, \dots, x_n] is the same thing as R[x_1, \dots, x_{n-1}][x_n] by induction. This allows us to prove that any polynomial ring is a unique factorization domain:

Theorem: If R is a UFD, so is R[x_1, \dots, x_n].

Proof. R[x] is a UFD as described above. By the lemma, R[x_1, \dots, x_n] = R[x_1, \dots, x_{n-1}][x_n] so by induction R[x_1, \dots, x_{n-1}] is a UFD implies R[x_1, \dots, x_n] is as well.

\square

We’ll be very interested in exactly how to compute useful factorizations of polynomials into primes when we start our series on computational algebraic geometry. Some of the applications include robot motion planning, and automated theorem proving.

Next time we’ll visit the concept of an ideal, see quotient rings, and work toward proving Hilbert’s Nullstellensatz, a fundamental result in algebraic geometry.

Until then!

About these ads

Introducing Categories

It is time for us to formally define what a category is, to see a wealth of examples. In our next post we’ll see how the definitions laid out here translate to programming constructs. As we’ve said in our soft motivational post on categories, the point of category theory is to organize mathematical structures across various disciplines into a unified language. As such, most of this post will be devote to laying down the definition of a category and the associated notation. We will be as clear as possible to avoid a notational barrier for newcomers, so if anything is unclear we will clarify it in the comments.

Definition of a Category

Let’s recall some examples of categories we’ve seen on this blog that serve to motivate the abstract definition of a category. We expect the reader to be comfortable with sets, and to absorb or glaze over the other examples as comfort dictates. The reader who is uncomfortable with sets and functions on sets should stop here. Instead, visit our primers on proof techniques, which doubles as a primer on set theory (or our terser primer on set theory from a two years ago).

The go-to example of a category is that of sets: sets together with functions between sets form a category. We will state exactly what this means momentarily, but first some examples of categories of “sets with structure” and “structure-preserving maps.”

Groups together with group homomorphisms form a category, as do rings and fields with their respective kinds of homomorphisms. Topological spaces together with continuous functions form a category, and metric spaces with distance-nonincreasing maps (“short” functions) form a sub-category. Vector spaces and linear maps, smooth manifolds and smooth maps, and algebraic varieties with rational maps all form categories. We could continue but the essential idea is clear: a category is some way to specify a collection of objects and “structure-preserving” mappings between those objects. There are three main features common to all of these examples:

  1. Composition of structure-preserving maps produces structure-preserving maps.
  2. Composition is associative.
  3. There is an identity map for each object.

The main abstraction is that forgetting what the objects and mappings are and only considering how they behave allows one to deviate from the examples above in useful ways. For instance, once we see the formal definition below, it will become clear that mathematical (say, first-order logical) statements, together with proofs of implication, form a category. Even though a “proof” isn’t strictly a structure-preserving map, it still fits with the roughly stated axioms above. One can compose proofs by laying the implications out one after another, this composition is trivially associative, and there is an identity proof. Thus, proofs provide a way to “transform” true statements into true statements, preserving the structure of boolean-valued truth.

Another example is the category of ML types and computable functions in ML. Computable functions can be quite wild in their behavior, but they are nevertheless composable, associative, and equipped with an identity.

And so the definition of a category seems to come as a natural consequence to think of all of these examples as special cases of one concept.

Before we state the definition we should note that, for abstruse technical reasons, we cannot phrase the definition of a category as a “set” of objects and mappings between objects. This is already impossible for the category of sets, because there is no “set of all sets.” Somehow (as illustrated by Russell’s paradox) there are “too many” sets to do this. Likewise, there is no “set” of all groups, topological spaces, vector spaces, etc.

This apparent difficulty requires some sidestepping. One possibility is to define a universe of non-paradoxical sets, and define categories by way of a universe. Another is to define a class, which bypasses set theory in another way. We won’t deliberate on the differences between these methods of avoiding Russell’s paradox. The reader need only know that it can be done. For our official definition, we will use the terminology of classes.

Definition: A category \textbf{C} consists of the following data:

  • A class of objects, denoted \textup{Obj}(\mathbf{C}).
  • For each pair of objects A,B, a set \textup{Hom}(A,B) of morphisms. Sometimes these are called hom-sets.

The morphisms satisfy the following conditions:

  • For all objects A,B,C and morphisms f \in \textup{Hom}(A,B), g \in \textup{Hom}(B,C) there is a composition operation \circ and g \circ f \in \textup{Hom}(A,C) is a morphism. We will henceforth drop the \circ and write gf.
  • Composition is associative.
  • For all objects A, there is an identity morphism 1_A \in \textup{Hom}(A,A). For all A,B, and f \in \textup{Hom}(A,B), we have f 1_A = f and 1_B f = f.

Some additional notation and terminology: we denote a morphism f \in \textup{Hom}(A,B) in three ways. Aside from “as an element of a set,” the most general way is as a diagram,

\displaystyle A \xrightarrow{\hspace{.5cm} f \hspace{.5cm}} B

Although we will often shorten a single morphism to the standard function notation f: A \to B. Given a morphism f: A \to B, we call A the source of f, and B the target of f.

We will often name our categories, and we will do so in bold. So the category of sets with set-functions is denoted \mathbf{Set}. When working with multiple categories, we will give subscripts on the hom-sets to avoid ambiguity, as in \textup{Hom}_{\mathbf{Set}}(A,B). The set of morphisms from an object to itself is called an endomorphism, and the set of all endomorphisms of an object A is denoted \textup{End}_{\mathbf{C}}(A).

Note that in the definition above we require that morphisms form a set. This is important because hom-sets will have additional algebraic structure (e.g., for certain categories \textup{End}_{\mathbf{C}}(A) will form a ring).

Examples of Categories

Lets now formally define some of our simple examples of categories. Defining a category amounts to specifying the objects and morphisms, and verifying the conditions in the definition hold.

Sets

Define the category \mathbf{Set} whose objects are all sets, and whose morphisms are functions on sets. By now it should hopefully be clear that sets form a category, but let us go through the motions explicitly. Every set A has an identity function 1_A(x) = x, and as we already know, functions on sets compose associatively. To verify this in complete detail would amount to writing down a general function as a relation, and using the definitions from elementary set theory. We have more pressing matters, but a reader who has not seen this before should consult our set theory primer. One can also define the category of finite sets \mathbf{FinSet} whose objects are finite sets and whose morphisms are again set-functions. As every object and morphism of \mathbf{FinSet} is one of \mathbf{Set}, we call the latter a subcategory of the former.

Finite categories

The most trivial possible categories are those with only finitely many objects. For instance, define the category \mathbf{1} to have a single object \ast and a single morphism 1_{\ast}, which must of course be the identity. The composition 1_{\ast} 1_{\ast} is forced to be 1_{\ast}, and so this is a (rather useless) category. One can also imagine a category \mathbf{2} which has one non-identity morphism A \to B as well as examples of categories with any number of finite objects. Nothing interesting is going on here; we just completely specify the structure of the category object by object and morphism by morphism. Sometimes they come in handy, though.

Poset Categories

Here is an elementary example of a category that is nonetheless fundamental to modern discussions in topology and algebraic geometry. It will show up again in our work with persistent homology. Let X be any set, and define the category \mathbf{X}_\subset as follows. The objects of \mathbf{X}_\subset are all the subsets of X. If A \subset B are subsets of X, then the set \textup{Hom}(A,B) is defined to be the (unique, unnamed) singleton A \to B. Otherwise, \textup{Hom}(A,B) is the empty set. Identities exist since every set is a subset of itself. The property of being a subset is transitive, so composition of morphisms makes sense and associativity is trivial since there is at most one morphism between any two objects. If the reader doesn’t believe this, we state what composition is rigorously: define each morphism as a pair (A,B) and define the composition operation as a set-function \textup{Hom}(A,B) \times \textup{Hom}(B,C) \to \textup{Hom}(A,C), which sends ((A,B), (B,C)) \mapsto (A,C). Because this operation is so trivial, it seems more appropriate to state it with a diagram:

first-diagramWe say this diagram commutes if all ways to compose morphisms (travel along arrows) have equal results. That is, in the diagram above, we assert that the morphism A \to B composed with the morphism B \to C is exactly the one morphism A \to C. Usually one must prove that a diagram commutes, but in this case we are defining the composition operation so that commutativity holds. The reader can now directly verify that composition is associative:

(a,b)((b,c)(c,d)) = (a,b)(b,d) = (a,d) = (a,c)(c,d) = ((a,b)(b,c))(c,d)

More generally, it is not hard to see how any transitive reflexive relation on a set (including partial orders) can be used to form a category: objects are elements of the set, and morphisms are unique arrows which exist when the objects are (asymmetrically) related. The subset category above is a special case where the set in question is the power set of X, and the relation is \subset. The familiar reader should note that the most prominent example of this in higher mathematics is to have X be the topology of a topological space (the set of open subsets).

Groups

Next, define the category \mathbf{Grp} whose objects are groups and whose morphisms are group homomorphisms. Recall briefly that a group is a set G endowed with a sensible (associative) binary operation denoted by multiplication, which has an identity and with respect to which every element has an inverse. For the uninitiated reader, just replace any abstract “group” by the set of all nonzero rational numbers with usual multiplication. It will suffice for this example.

A group homomorphism is a set-function f: A \to B which satisfies f(xy) = f(x)f(y) (here the binary operations on the left and right side of the equal sign are the operations in the two respective groups). Being set-functions, group homomorphisms are composable as functions and associatively so. Given any homomorphism g: B \to C, we need to show that the composite is again a homomorphism:

\displaystyle gf(xy) = g(f(xy)) = g(f(x)f(y)) = g(f(x))g(f(y)) = gf(x) gf(y)

Note that there are three multiplication operations floating around in this equation. Groups (as all sets) have identity maps, and the identity map is a perfectly good group homomorphism. This verifies that \mathbf{Grp} is indeed a category. While we could have stated all of these equalities via commutative diagrams, the pictures are quite large and messy so we will avoid them until later.

A similar derivation will prove that rings form a category, as do vector spaces, topological spaces, and fields. We are unlikely to use these categories in great detail in this series, so we refrain from giving them names for now.

One special example of a category of groups is the category \mathbf{Ab} of abelian groups (for which the multiplication operation is commutative). This category shows up as the prototypical example of a so-called “abelian category,” which means it has enough structure to do homology.

Graphs

In a more discrete domain, define by \mathbf{Graph} as follows. The objects in this category are triples (V,E, \varphi) where \varphi: E \to V \times V represents edge adjacency. We will usually supress \varphi by saying vertices v,w are adjacent instead of \varphi(e) = (v,w). The morphisms in this category are graph homomorphisms. That is, if G = (V,E,\varphi) and G' = (V', E',\varphi'), then f \in \textup{Hom}_{\mathbf{Graph}}(G,G') is a pair of set functions on f_V: V \to V', f_E: E \to E', satisfying the following commutative diagram. Here we denote by (f,f) the map which sends (u,v) \to (f(u), f(v)).

graph-hom-diagram

This diagram is quite a mouthful, but in words it requires that whenever v,w are adjacent in G, then f(v), f(w) are adjacent in G'. Rewriting the diagram as an equation more explicitly, we are saying that if e \in E is an edge with \varphi(e) = (v,w), then it must be the case that (f_V(v), f_V(w)) = \varphi'(f_E(e)). This is how one “preserves” the structure of a graph.

To prove this is a category, we can observe that composition makes sense: given two pairs

\displaystyle f_V:V \to V'
\displaystyle f_E: E \to E'

and

\displaystyle g_{V'}: V' \to V''
\displaystyle g_{E'}: E' \to E''

we can compose each morphism individually, getting gf_V: V \to V'' and gf_E: E \to E'', and the following hefty-looking commutative diagram.

graph-commutative

Let’s verify commutativity together: we already know the two squares on the left and right commute (by hypothesis, they are morphisms in this category). So we have two things left to check, that the three ways to get from E to V'' \times V'' are all the same. This amounts to verifying the two equalities

\displaystyle (g_{V'}, g_{V'}) (f_V, f_V) \varphi = (g_{V'}, g_{V'}) \varphi' f_E = \varphi'' g_{E'} f_E

But indeed, going left to right in the above equation, each transition from one expression to another only swaps two morphisms within one of the commutative squares. In other words, the first equality is already enforced by the commutativity of the left-hand square, and the second by the right. We are literally only substituting what we already know to be equal!

If it feels like we didn’t actually do any work there (aside from unravelling exactly what the diagrams mean), then you’re already starting to see the benefits of category theory. It can often feel like a cycle: commutative diagrams make it easy to argue about other commutative diagrams, and one can easily get lost in the wilderness of arrows. But more often than not, devoting a relatively small amount of time up front to show one diagram commutes will make a wealth of facts and theorems follow with no extra work. This is part of the reason category theory is affectionately referred to as “abstract nonsense.”

Diagram Categories

Speaking of abstract nonsense: next up is a very abstract example, but thinking about it will reinforce the ideas we’ve put forth in this post while giving a sneak peek to our treatment of universal properties. Fix an object A in an arbitrary category \mathbf{C}. Define the category \mathbf{C}_A whose objects are morphisms with source A. That is, an object of \mathbf{C}_A looks like

vert-arrowWhere B ranges over all objects of \mathbf{C}. The morphisms of \mathbf{C}_A are commutative diagrams of the form

diagram-morphism

where as we said earlier the stipulation asserted by the diagram is that f' = gf. Let’s verify that the axioms for a category hold. Suppose we have two such commutative diagrams with matching target and source, say,

composing-diagram-morphisms

Note that the arrows g: A \to C must match up in both diagrams, or else composition does not make sense! Then we can combine them into a single commutative diagram:

composed-diagram-morphism

If it is not obvious that this diagram commutes, we need only write down the argument explicitly: by the first diagram \beta f = g and \gamma g = h, and piecing these together we have \gamma \beta f = \gamma g = h. Associativity of this “piecing together” follows from the associativity of the morphisms in \mathbf{C}. The identity morphism is a diagram whose two “legs” are both f: A \to B and whose connecting morphism is just 1_B.

This kind of “diagram” category is immensely important, and we will revisit it and many variants of it in the future. A quick sneak peek: this category is closely related to the universal properties of polynomial rings, free objects, and limits.

As a slight generalization, you can define a category whose objects consist of pairs of morphisms with a shared source (or target), i.e.,

pairs-category

We challenge the reader to write down what a morphism of this category would look like, and prove the axioms for a category hold (it’s not hard, but a bit messy). We will revisit categories like this one in our post on universal properties; this particular one is intimately related to products.

Next Time

Next time we’ll jump straight into some code, and realize the definition of a category as a type in ML. We’ll see how some of our examples of categories here can be implemented using the code, and inspect the pro’s and con’s of the computational version of our definition.

Probabilistic Bounds — A Primer

Probabilistic arguments are a key tool for the analysis of algorithms in machine learning theory and probability theory. They also assume a prominent role in the analysis of randomized and streaming algorithms, where one imposes a restriction on the amount of storage space an algorithm is allowed to use for its computations (usually sublinear in the size of the input).

While a whole host of probabilistic arguments are used, one theorem in particular (or family of theorems) is ubiquitous: the Chernoff bound. In its simplest form, the Chernoff bound gives an exponential bound on the deviation of sums of random variables from their expected value.

This is perhaps most important to algorithm analysis in the following mindset. Say we have a program whose output is a random variable X. Moreover suppose that the expected value of X is the correct output of the algorithm. Then we can run the algorithm multiple times and take a median (or some sort of average) across all runs. The probability that the algorithm gives a wildly incorrect answer is the probability that more than half of the runs give values which are wildly far from their expected value. Chernoff’s bound ensures this will happen with small probability.

So this post is dedicated to presenting the main versions of the Chernoff bound that are used in learning theory and randomized algorithms. Unfortunately the proof of the Chernoff bound in its full glory is beyond the scope of this blog. However, we will give short proofs of weaker, simpler bounds as a straightforward application of this blog’s previous work laying down the theory.

If the reader has not yet intuited it, this post will rely heavily on the mathematical formalisms of probability theory. We will assume our reader is familiar with the material from our first probability theory primer, and it certainly wouldn’t hurt to have read our conditional probability theory primer, though we won’t use conditional probability directly. We will refrain from using measure-theoretic probability theory entirely (some day my colleagues in analysis will like me, but not today).

Two Easy Bounds of Markov and Chebyshev

The first bound we’ll investigate is almost trivial in nature, but comes in handy. Suppose we have a random variable X which is non-negative (as a function). Markov’s inequality is the statement that, for any constant a > 0,

\displaystyle \textup{P}(X \geq a) \leq \frac{\textup{E}(X)}{a}

In words, the probability that X grows larger than some fixed constant is bounded by a quantity that is inversely proportional to the constant.

The proof is quite simple. Let \chi_a be the indicator random variable for the event that X \geq a (\chi_a = 1 when X \geq a and zero otherwise). As with all indicator random variables, the expected value of \chi_a is the probability that the event happens (if this is mysterious, use the definition of expected value). So \textup{E}(\chi_a) = \textup{P}(X \geq a), and linearity of expectation allows us to include a factor of a:

\textup{E}(a \chi_a) = a \textup{P}(X \geq a)

The rest of the proof is simply the observation that \textup{E}(a \chi_a) \leq \textup{E}(X). Indeed, as random variables we have the inequality a \chi_a \leq X. Whenever a < X, the former is equal to zero while the latter is nonnegative. And whenever a \geq X, the former is precisely a while the latter is by assumption at least a. It follows that \textup{E}(a \chi_a) \leq \textup{E}(X).

This last point is a simple property of expectation we omitted from our first primer. It usually goes by monotonicity of expectation, and we prove it here. First, if X \geq 0 then \textup{E}(X) \geq 0 (this is trivial). Second, if 0 \leq X \leq Y, then define a new random variable Z = Y-X. Since Z \geq 0 and using linearity of expectation, it must be that \textup{E}(Z) = \textup{E}(Y) - \textup{E}(X) \geq 0. Hence \textup{E}(X) \geq \textup{E}(Y). Note that we do require that X has a finite expected value for this argument to work, but if this is not the case then Markov’s inequality is nonsensical anyway.

Markov’s inequality by itself is not particularly impressive or useful. For example, if X is the number of heads in a hundred coin flips, Markov’s inequality ensures us that the probability of getting at least 99 heads is at most 50/99, which is about 1/2. Shocking. We know that the true probability is much closer to 2^{-100}, so Markov’s inequality is a bust.

However, it does give us a more useful bound as a corollary. This bound is known as Chebyshev’s inequality, and its use is sometimes referred to as the second moment method because it gives a bound based on the variance of a random variable (instead of the expected value, the “first moment”).

The statement is as follows.

Chebyshev’s Inequality: Let X be a random variable with finite expected value and positive variance. Then we can bound the probability that X deviates from its expected value by a quantity that is proportional to the variance of X. In particular, for any \lambda > 0,

\displaystyle \textup{P}(|X - \textup{E}(X)| \geq \lambda) \leq \frac{\textup{Var}(X)}{\lambda^2}

And without any additional assumptions on X, this bound is sharp.

Proof. The proof is a simple application of Markov’s inequality. Let Y = (X - \textup{E}(X))^2, so that \textup{E}(Y) = \textup{Var}(X). Then by Markov’s inequality

\textup{P}(Y \geq \lambda^2) \leq \frac{\textup{E}(Y)}{\lambda^2}

Since Y is nonnegative |X - \textup{E}(X)| = \sqrt(Y), and \textup{P}(Y \geq \lambda^2) = \textup{P}(|X - \textup{E}(X)| \geq \lambda). The theorem is proved. \square

Chebyshev’s inequality shows up in so many different places (and usually in rather dry, technical bits), that it’s difficult to give a good example application.  Here is one that shows up somewhat often.

Say X is a nonnegative integer-valued random variable, and we want to argue about when X = 0 versus when X > 0, given that we know \textup{E}(X). No matter how large \textup{E}(X) is, it can still be possible that \textup{P}(X = 0) is arbitrarily close to 1. As a colorful example, let X is the number of alien lifeforms discovered in the next ten years. We might debate that \textup{E}(X) can arbitrarily large: if some unexpected scientific and technological breakthroughs occur tomorrow, we could discover an unbounded number of lifeforms. On the other hand, we are very likely not to discover any, and probability theory allows for such a random variable to exist.

If we know everything about \textup{Var}(X), however, we can get more informed bounds.

Theorem: If \textup{E}(X) \neq 0, then \displaystyle \textup{P}(X = 0) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}.

Proof. Simply choose \lambda = \textup{E}(X) and apply Chebyshev’s inequality.

\displaystyle \textup{P}(X = 0) \leq \textup{P}(|X - \textup{E}(X)| \geq \textup{E}(X)) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}

The first inequality follows from the fact that the only time X can ever be zero is when |X - \textup{E}(X)| = \textup{E}(X), and X=0 only accounts for one such possibility. \square

This theorem says more. If we know that \textup{Var}(X) is significantly smaller than \textup{E}(X)^2, then X > 0 is more certain to occur. More precisely, and more computationally minded, suppose we have a sequence of random variables X_n so that \textup{E}(X_n) \to \infty as n \to \infty. Then the theorem says that if \textup{Var}(X_n) = o(\textup{E}(X_n)^2), then \textup{P}(X_n > 0) \to 1. Remembering one of our very early primers on asymptotic notation, f = o(g) means that f grows asymptotically slower than g, and in terms of this fraction \textup{Var}(X) / \textup{E}(X)^2, this means that the denominator dominates the fraction so that the whole thing tends to zero.

The Chernoff Bound

The Chernoff bound takes advantage of an additional hypothesis: our random variable is a sum of independent coin flips. We can use this to get exponential bounds on the deviation of the sum. More rigorously,

Theorem: Let X_1 , \dots, X_n be independent random \left \{ 0,1 \right \}-valued variables, and let X = \sum X_i. Suppose that \mu = \textup{E}(X). Then the probability that X deviates from \mu by more than a factor of \lambda > 0 is bounded from above:

\displaystyle \textup{P}(X > (1+\lambda)\mu) \leq \frac{e^{\lambda \mu}}{(1+\lambda)^{(1+\lambda)\mu}}

The proof is beyond the scope of this post, but we point the interested reader to these lecture notes.

We can apply the Chernoff bound in an easy example. Say all X_i are fair coin flips, and we’re interested in the probability of getting more than 3/4 of the coins heads. Here \mu = n/2 and \lambda = 1/2, so the probability is bounded from above by

\displaystyle \left ( \frac{e}{(3/2)^3} \right )^{n/4} \approx \frac{1}{5^n}

So as the number of coin flips grows, the probability of seeing such an occurrence diminishes extremely quickly to zero. This is important because if we want to test to see if, say, the coins are biased toward flipping heads, we can simply run an experiment with n sufficiently large. If we observe that more than 3/4 of the flips give heads, then we proclaim the coins are biased and we can be assured we are correct with high probability. Of course, after seeing 3/4 of more heads we’d be really confident that the coin is biased. A more realistic approach is to define some \varepsilon that is small enough so as to say, “if some event occurs whose probability is smaller than \varepsilon, then I call shenanigans.” Then decide how many coins and what bound one would need to make the bad event have probability approximately \varepsilon. Finding this balance is one of the more difficult aspects of probabilistic algorithms, and as we’ll see later all of these quantities are left as variables and the correct values are discovered in the course of the proof.

Chernoff-Hoeffding Inequality

The Hoeffding inequality (named after the Finnish statistician, Wassily Høffding) is a variant of the Chernoff bound, but often the bounds are collectively known as Chernoff-Hoeffding inequalities. The form that Hoeffding is known for can be thought of as a simplification and a slight generalization of Chernoff’s bound above.

Theorem: Let X_1, \dots, X_n be independent random variables whose values are within some range [a,b]. Call \mu_i = \textup{E}(X_i), X = \sum_i X_i, and \mu \textup{E}(X) = \sum_i \mu_i. Then for all t > 0,

\displaystyle \textup{P}(|X - \mu| > t) \leq 2e^{(-2t^2)/(b-a)^2}

For example, if we are interested in the sum of n rolls of a fair six-sided die, then the probability that we deviate from (7/2)n by more than \sqrt{n} is bounded by 2e^{(-2n / 25)}. Supposing we want to know how many rolls we need to guarantee with probability 0.01 that we don’t deviate too much, we just do the algebra:

2e^{(-2n / 25)} < 0.01
-2n / 25 < \log (0.005)
n > \log(0.005^{(-25/2)}) \approx 66

So with 67 rolls we can be confident that the sum of the rolls will lie between 222 and 240.

Another version of this theorem concerns the average of the X_i, and is only a minor modification of the above.

Theorem: If X_1, \dots, X_n are as above, and X = \frac{1}{n} \sum_i X_i, with \mu = \frac{1}{n}(\sum_i \mu_i), then for all t > 0, we get the following bound

\displaystyle \textup{P}(|X - \mu| > t) \leq 2e^{(-2nt^2)/(b-a)^2}

The only difference here is the extra factor of n in the exponent. So the deviation is exponential both in the amount of deviation (t^2), and in the number of trials.

This theorem comes up very often in learning theory, in particular to prove Boosting works. Mathematicians will joke about how all theorems in learning theory are just applications of Chernoff-Hoeffding-type bounds. We’ll of course be seeing it again as we investigate boosting and the PAC-learning model in future posts, so we’ll see the theorems applied to their fullest extent then.

Until next time!

Computing Homology

In our last post in this series on topology, we defined the homology group. Specifically, we built up a topological space as a simplicial complex (a mess of triangles glued together), we defined an algebraic way to represent collections of simplices called chains as vectors in a vector space, we defined the boundary homomorphism as a linear map on chains, and finally defined the homology groups as the quotient vector spaces

\displaystyle H_k(X) = \frac{\textup{ker} \partial_k}{\textup{im} \partial_{k+1}}.

The number of holes in X was just the dimension of this quotient space.

In this post we will be quite a bit more explicit. Because the chain groups are vector spaces and the boundary mappings are linear maps, they can be represented as matrices whose dimensions depend on our simplicial complex structure. Better yet, if we have explicit representations of our chains by way of a basis, then we can use row-reduction techniques to write the matrix in a standard form.

Of course the problem arises when we want to work with two matrices simultaneously (to compute the kernel-mod-image quotient above). This is not computationally any more difficult, but it requires some theoretical fiddling. We will need to dip a bit deeper into our linear algebra toolboxes to see how it works, so the rusty reader should brush up on their linear algebra before continuing (or at least take some time to sort things out if or when confusion strikes).

Without further ado, let’s do an extended example and work our ways toward a general algorithm. As usual, all of the code used for this post is available on this blog’s Google Code page.

Two Big Matrices

Recall our example simplicial complex from last time.

circle-wedge-sphere

We will compute H_1 of this simplex (which we saw last time was \mathbb{Q}) in a more algorithmic way than we did last time.

Once again, we label the vertices 0-4 so that the extra “arm” has vertex 4 in the middle, and its two endpoints are 0 and 2. This gave us orientations on all of the simplices, and the following chain groups. Since the vertex labels (and ordering) are part of the data of a simplicial complex, we have made no choices in writing these down.

\displaystyle C_0(X) = \textup{span} \left \{ 0,1,2,3,4 \right \}

\displaystyle C_1(X) = \textup{span} \left \{ [0,1], [0,2], [0,3], [0,4], [1,2], [1,3],[2,3],[2,4] \right \}

\displaystyle C_2(X) = \textup{span} \left \{ [0,1,2], [0,1,3], [0,2,3], [1,2,3] \right \}

Now given our known definitions of \partial_k as an alternating sum from last time, we can give a complete specification of the boundary map as a matrix. For \partial_1, this would be

\displaystyle \partial_1 = \bordermatrix{  & [0,1] & [0,2] & [0,3] & [0,4] & [1,2] & [1,3] & [2,3] & [2,4] \cr  0 & -1 & -1 & -1 & -1 & 0 & 0 & 0 & 0\cr  1 & 1 & 0 & 0 & 0 & -1 & -1 & 0 & 0\cr  2 & 0 & 1 & 0 & 0 & 1 & 0 & -1 & -1 \cr  3 & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 \cr  4 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 },

where the row labels are the basis for C_0(X) and the column labels are the basis for C_1(X). Similarly, \partial_2 is

\displaystyle \partial_2 = \bordermatrix{  & [0,1,2] & [0,1,3] & [0,2,3] & [1,2,3] \cr  [0,1] & 1 & 1 & 0 & 0\cr  [0,2] & -1 & 0 & 1 & 0\cr  [0,3] & 0 & -1 & -1 & 0\cr  [0,4] & 0 & 0 & 0 & 0\cr  [1,2] & 1 & 0 & 0 & 1\cr  [1,3] & 0 & 1 & 0 & -1\cr  [2,3] & 0 & 0 & 1 & 1\cr  [2,4] & 0 & 0 & 0 & 0}

The reader is encouraged to check that these matrices are written correctly by referring to the formula for \partial as given last time.

Remember the crucial property of \partial, that \partial^2 = \partial_k \partial_{k+1} = 0. Indeed, the composition of the two boundary maps just corresponds to the matrix product of the two matrices, and one can verify by hand that the above two matrices multiply to the zero matrix.

We know from basic linear algebra how to compute the kernel of a linear map expressed as a matrix: column reduce and inspect the columns of zeros. Since the process of row reducing is really a change of basis, we can encapsulate the reduction inside a single invertible matrix A, which, when left-multiplied by \partial, gives us the reduced form of the latter. So write the reduced form of \partial_1 as \partial_1 A.

However, now we’re using two different sets of bases for the shared vector space involved in \partial_1 and \partial_2. In general, it will no longer be the case that \partial_kA\partial_{k+1} = 0. The way to alleviate this is to perform the “corresponding” change of basis in \partial_{k+1}. To make this idea more transparent, we return to the basics.

Changing Two Matrices Simultaneously

Recall that a matrix M represents a linear map between two vector spaces f : V \to W. The actual entries of M depend crucially on the choice of a basis for the domain and codomain. Indeed, if v_i form a basis for V and w_j for W, then the k-th column of the matrix representation M is defined to be the coefficients of the representation of f(v_k) in terms of the w_j. We hope to have nailed this concept down firmly in our first linear algebra primer.

Recall further that row operations correspond to changing a basis for the codomain, and column operations correspond to changing a basis for the domain. For example, the idea of swapping columns i,j in M gives a new matrix which is the representation of f with respect to the (ordered) basis for V which swaps the order of v_i , v_j. Similar things happen for all column operations (they all correspond to manipulations of the basis for V), while analogously row operations implicitly transform the basis for the codomain. Note, though, that the connection between row operations and transformations of the basis for the codomain are slightly more complicated than they are for the column operations. We will explicitly see how it works later in the post.

And so if we’re working with two maps A: U \to V and B: V \to W, and we change a basis for V in B via column reductions, then in order to be consistent, we need to change the basis for V in A via “complementary” row reductions. That is, if we call the change of basis matrix Q, then we’re implicitly sticking Q in between the composition BA to get (BQ)A. This is not the same map as BA, but we can make it the same map by adding a Q^{-1} in the right place:

\displaystyle BA = B(QQ^{-1})A = (BQ)(Q^{-1}A)

Indeed, whenever Q is a change of basis matrix so is Q^{-1} (trivially), and moreover the operations that Q performs on the columns of B are precisely the operations that Q^{-1} performs on the rows of A (this is because elementary row operations take different forms when multiplied on the left or right).

Coming back to our boundary operators, we want a canonical way to view the image of \partial_{k+1} as sitting inside the kernel of \partial_k. If we go ahead and use column reductions to transform \partial_k into a form where the kernel is easy to read off (as the columns consisting entirely of zeroes), then the corresponding row operations, when performed on \partial_{k+1} will tell us exactly the image of \partial_{k+1} inside the kernel of \partial_k.

This last point is true precisely because \textup{im} \partial_{k+1} \subset \textup{ker} \partial_k. This fact guarantees that the irrelevant rows of the reduced version of \partial_{k+1} are all zero.

Let’s go ahead and see this in action on our two big matrices above. For \partial_1, the column reduction matrix is

\displaystyle A =  \begin{pmatrix}  0 & 1 & 0 & 0 & 1 & 1 & 0 & 0\\  0 & 0 & 1 & 0 & -1 & 0 & 1 & 1\\  0 & 0 & 0 & 1 & 0 & -1 & -1 & 0\\  -1 & -1 & -1 & -1 & 0 & 0 & 0 & -1\\  0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\  0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\  0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\  0 & 0 & 0 & 0 & 0 & 0 & 0 & 1  \end{pmatrix}

And the product \partial_1 A is

\displaystyle \partial_1 A =  \begin{pmatrix}  1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\  0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\  0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\  0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\  -1 & -1 & -1 & -1 & 0 & 0 & 0 & 0  \end{pmatrix}

Now the inverse of A, which is the corresponding basis change for \partial_2, is

\displaystyle A^{-1} =  \begin{pmatrix}  -1 & -1 & -1 & -1 & -0 & -0 & -0 & -0\\  1 & 0 & 0 & 0 & -1 & -1 & 0 & 0\\  0 & 1 & 0 & 0 & 1 & 0 & -1 & -1\\  0 & 0 & 1 & 0 & 0 & 1 & 1 & 0\\  0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\  0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\  0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\  0 & 0 & 0 & 0 & 0 & 0 & 0 & 1  \end{pmatrix}

and the corresponding reduced form of \partial_2 is

\displaystyle A^{-1} \partial_2 =  \begin{pmatrix}  0 & 0 & 0 & 0\\  0 & 0 & 0 & 0\\  0 & 0 & 0 & 0\\  0 & 0 & 0 & 0\\  1 & 0 & 0 & 1\\  0 & 1 & 0 & -1\\  0 & 0 & 1 & 1\\  0 & 0 & 0 & 0  \end{pmatrix}

As a side note, we got these matrices by slightly modifying the code from our original post on row reduction to output the change of basis matrix in addition to performing row reduction. It turns out one can implement column reduction as row reduction of the transpose, and the change of basis matrix you get from this process will be the transpose of the change of basis matrix you want (by (AB)^\textup{T} = (B^\textup{T}A^\textup{T})). Though the code is particularly ad-hoc, we include it with the rest of the code used in this post on this blog’s Google Code page.

Now let’s inspect the two matrices \partial_1 A and A^{-1} \partial_2 more closely. The former has four “pivots” left over, and this corresponds to the rank of the matrix being 4. Moreover, the four basis vectors representing the columns with nonzero pivots, which we’ll call v_1, v_2, v_3, v_4 (we don’t care what their values are), span a complementary subspace to the kernel of \partial_1. Hence, the remaining four vectors (which we’ll call v_5, v_6, v_7, v_8) span the kernel. In particular, this says that the kernel has dimension 4.

On the other hand, we performed the same transformation of the basis of C_1(X) for \partial_2. Looking at the matrix that resulted, we see that the first four rows and the last row (representing v_1, v_2, v_3, v_4, v_8) are entirely zeros and so the image of \partial_2 intersects their span trivially. and the remaining three rows (representing v_5, v_6, v_7) have nonzero pivots. This tells us exactly that the image of \partial_2 is spanned by v_5, v_6, v_7.

And now, the coup de grâce, the quotient to get homology is simply

\displaystyle \frac{ \textup{span} \left \{ v_5, v_6, v_7, v_8 \right \}}{ \textup{span} \left \{ v_5, v_6, v_7 \right \}} = \textup{span} \left \{ v_8 \right \}

And the dimension of the homology group is 1, as desired.

The General Algorithm

It is no coincidence that things worked out at nicely as they did. The process we took of simultaneously rewriting two matrices with respect to a common basis is the bulk of the algorithm to compute homology. Since we’re really only interested in the dimensions of the homology groups, we just need to count pivots. If the number of pivots arising in \partial_k is y and the number of pivots arising in \partial_{k+1} is z, and the dimension of C_k(X) is n, then the dimension is exactly

(n-y) - z = \textup{dim}(\textup{ker} \partial_k) - \textup{dim}(\textup{im}\partial_{k+1})

And it is no coincidence that the pivots lined up so nicely to allow us to count dimensions this way. It is a minor exercise to prove it formally, but the fact that the composition \partial_k \partial_{k+1} = 0 implies that the reduced version of \partial_{k+1} will have an almost reduced row-echelon form (the only difference being the rows of zeros interspersed above, below, and possibly between pivot rows).

As the reader may have guessed at this point, we don’t actually need to compute A and A^{-1}. Instead of this, we can perform the column/row reductions simultaneously on the two matrices. The above analysis helped us prove the algorithm works, and with that guarantee we can throw out the analytical baggage and just compute the damn thing.

Indeed, assuming the input is already processed as two matrices representing the boundary operators with respect to the standard bases of the chain groups, computing homology is only slightly more difficult than row reducing in the first place. Putting our homology where our mouth is, we’ve implemented the algorithm in Python. As usual, the entire code used in this post is available on this blog’s Google Code page.

The first step is writing auxiliary functions to do elementary row and column operations on matrices. For this post, we will do everything in numpy (which makes the syntax shorter than standard Python syntax, but dependent on the numpy library).

import numpy

def rowSwap(A, i, j):
   temp = numpy.copy(A[i, :])
   A[i, :] = A[j, :]
   A[j, :] = temp

def colSwap(A, i, j):
   temp = numpy.copy(A[:, i])
   A[:, i] = A[:, j]
   A[:, j] = temp

def scaleCol(A, i, c):
   A[:, i] *= c*numpy.ones(A.shape[0])

def scaleRow(A, i, c):
   A[i, :] *= c*numpy.ones(A.shape[1])

def colCombine(A, addTo, scaleCol, scaleAmt):
   A[:, addTo] += scaleAmt * A[:, scaleCol]

def rowCombine(A, addTo, scaleRow, scaleAmt):
   A[addTo, :] += scaleAmt * A[scaleRow, :]

From here, the main meat of the algorithm is doing column reduction on one matrix, and applying the corresponding row operations on the other.

def simultaneousReduce(A, B):
   if A.shape[1] != B.shape[0]:
      raise Exception("Matrices have the wrong shape.")

   numRows, numCols = A.shape # col reduce A

   i,j = 0,0
   while True:
      if i >= numRows or j >= numCols:
         break

      if A[i][j] == 0:
         nonzeroCol = j
         while nonzeroCol < numCols and A[i,nonzeroCol] == 0:
            nonzeroCol += 1

         if nonzeroCol == numCols:
            j += 1
            continue

         colSwap(A, j, nonzeroCol)
         rowSwap(B, j, nonzeroCol)

      pivot = A[i,j]
      scaleCol(A, j, 1.0 / pivot)
      scaleRow(B, j, 1.0 / pivot)

      for otherCol in range(0, numCols):
         if otherCol == j:
            continue
         if A[i, otherCol] != 0:
            scaleAmt = -A[i, otherCol]
            colCombine(A, otherCol, j, scaleAmt)
            rowCombine(B, j, otherCol, -scaleAmt)

      i += 1; j+= 1

   return A,B

This more or less parallels the standard algorithm for row-reduction (with the caveat that all the indices are swapped to do column-reduction). The only somewhat confusing line is the call to rowCombine, which explicitly realizes the corresponding row operation as the inverse of the performed column operation. Note that for row operations, the correspondence between operations on the basis and operations on the rows is not as direct as it is for columns. What’s given above is the true correspondence. Writing down lots of examples will reveal why, and we leave that as an exercise to the reader.

Then the actual algorithm to compute homology is just a matter of counting pivots. Here are two pivot-counting functions in a typical numpy fashion:

def numPivotCols(A):
   z = numpy.zeros(A.shape[0])
   return [numpy.all(A[:, j] == z) for j in range(A.shape[1])].count(False)

def numPivotRows(A):
   z = numpy.zeros(A.shape[1])
   return [numpy.all(A[i, :] == z) for i in range(A.shape[0])].count(False)

And the final function is just:

def bettiNumber(d_k, d_kplus1):
   A, B = numpy.copy(d_k), numpy.copy(d_kplus1)
   simultaneousReduce(A, B)

   dimKChains = A.shape[1]
   kernelDim = dimKChains - numPivotCols(A)
   imageDim = numPivotRows(B)

   return kernelDim - imageDim

And there we have it! We’ve finally tackled the beast, and written a program to compute algebraic features of a topological space.

The reader may be curious as to why we didn’t come up with a more full-bodied representation of a simplicial complex and write an algorithm which accepts a simplicial complex and computes all of its homology groups. We’ll leave this direct approach as a (potentially long) exercise to the reader, because coming up in this series we are going to do one better. Instead of computing the homology groups of just one simplicial complex using by repeating one algorithm many times, we’re going to compute all the homology groups of a whole family of simplicial complexes in a single bound. This family of simplicial complexes will be constructed from a data set, and so, in grandiose words, we will compute the topological features of data.

If it sounds exciting, that’s because it is! We’ll be exploring a cutting-edge research field known as persistent homology, and we’ll see some of the applications of this theory to data analysis.

Until then!