# Linear Regression

Machine learning is broadly split into two camps, statistical learning and non-statistical learning. The latter we’ve started to get a good picture of on this blog; we approached Perceptrons, decision trees, and neural networks from a non-statistical perspective. And generally “statistical” learning is just that, a perspective. Data is phrased in terms of independent and dependent variables, and statistical techniques are leveraged against the data. In this post we’ll focus on the simplest example of this, linear regression, and in the sequel see it applied to various learning problems.

As usual, all of the code presented in this post is available on this blog’s Github page.

## The Linear Model, in Two Variables

And so given a data set we start by splitting it into independent variables and dependent variables. For this section, we’ll focus on the case of two variables, $X, Y$. Here, if we want to be completely formal, $X,Y$ are real-valued random variables on the same probability space (see our primer on probability theory to keep up with this sort of terminology, but we won’t rely on it heavily in this post), and we choose one of them, say $X$, to be the independent variable and the other, say $Y$, to be the dependent variable. All that means in is that we are assuming there is a relationship between $X$ and $Y$, and that we intend to use the value of $X$ to predict the value of $Y$. Perhaps a more computer-sciencey terminology would be to call the variables features and have input features and output features, but we will try to stick to the statistical terminology.

As a quick example, our sample space might be the set of all people, $X$ could be age, and $Y$ could be height. Then by calling age “independent,” we’re asserting that we’re trying to use age to predict height.

One of the strongest mainstays of statistics is the linear model. That is, when there aren’t any known relationships among the observed data, the simplest possible relationship one could discover is a linear one. A change in $X$ corresponds to a proportional change in $Y$, and so one could hope there exist constants $a,b$ so that (as random variables) $Y = aX + b$.  If this were the case then we could just draw many pairs of sample values of $X$ and $Y$, and try to estimate the value of $a$ and $b$.

If the data actually lies on a line, then two sample points will be enough to get a perfect prediction. Of course, nothing is exact outside of mathematics. And so if we were to use data coming from the real world, and even if we were to somehow produce some constants $a, b$, our “predictor” would almost always be off by a bit. In the diagram below, where it’s clear that the relationship between the variables is linear, only a small fraction of the data points appear to lie on the line itself.

An example of a linear model for a set of points (credit Wikipedia).

In such scenarios it would be hopelessly foolish to wish for a perfect predictor, and so instead we wish to summarize the trends in the data using a simple description mechanism. In this case, that mechanism is a line. Now the computation required to find the “best” coefficients of the line is quite straightforward once we pick a suitable notion of what “best” means.

Now suppose that we call our (presently unknown) prediction function $\hat{f}$. We often call the function we’re producing as a result of our learning algorithm the hypothesis, but in this case we’ll stick to calling it a prediction function. If we’re given a data point $(x,y)$ where $x$ is a value of $X$ and $y$ of $Y$, then the error of our predictor on this example is $|y - \hat{f}(x)|$. Geometrically this is the vertical distance from the actual $y$ value to our prediction for the same $x$, and so we’d like to minimize this error. Indeed, we’d like to minimize the sum of all the errors of our linear predictor over all data points we see. We’ll describe this in more detail momentarily.

The word “minimize” might evoke long suppressed memories of torturous Calculus lessons, and indeed we will use elementary Calculus to find the optimal linear predictor. But one small catch is that our error function, being an absolute value, is not differentiable! To mend this we observe that minimizing the absolute value of a number is the same as minimizing the square of a number. In fact, $|x| = \sqrt(x^2)$, and the square root function and its inverse are both increasing functions; they preserve minima of sets of nonnegative numbers.  So we can describe our error as $(y - \hat{f}(x))^2$, and use calculus to our heart’s content.

To explicitly formalize the problem, given a set of data points $(x_i, y_i)_{i=1}^n$ and a potential prediction line $\hat{f}(x) = ax + b$, we define the error of $\hat{f}$ on the examples to be

$\displaystyle S(a,b) = \sum_{i=1}^n (y_i - \hat{f}(x_i))^2$

Which can also be written as

$\displaystyle S(a,b) = \sum_{i=1}^n (y_i - ax_i - b)^2$

Note that since we’re fixing our data sample, the function $S$ is purely a function of the variables $a,b$. Now we want to minimize this quantity with respect to $a,b$, so we can take a gradient,

$\displaystyle \frac{\partial S}{\partial a} = -2 \sum_{i=1}^n (y_i - ax_i - b) x_i$

$\displaystyle \frac{\partial S}{\partial b} = -2 \sum_{i=1}^n (y_i -ax_i - b)$

and set them simultaneously equal to zero. In the first we solve for $b$:

$\displaystyle 0 = -2 \sum_{i=1}^n y_i - ax_i - b = -2 \left ( nb + \sum_{i=1}^n y_i - ax_i \right )$

$\displaystyle b = \frac{1}{n} \sum_{i=1}^n y_i - ax_i$

If we denote by $x_{\textup{avg}} = \frac{1}{n} \sum_i x_i$ this is just $b = y_{\textup{avg}} - ax_{\textup{avg}}$. Substituting $b$ into the other equation we get

$\displaystyle -2 \sum_{i=1}^n (y_ix_i - ax_i^2 - y_{\textup{avg}}x_i - ax_{\textup{avg}}x_i ) = 0$

Which, by factoring out $a$, further simplifies to

$\displaystyle 0 = \sum_{i=1}^n y_ix_i - y_{\textup{avg}}x_i - a \sum_{i=1}^n (x_i^2 - x_{\textup{avg}}x_i)$

And so

$\displaystyle a = \frac{\sum_{i=1}^n (y_i - y_{\textup{avg}})x_i }{\sum_{i=1}^n(x_i - x_{\textup{avg}})x_i}$

And it’s not hard to see (by taking second partials, if you wish) that this corresponds to a minimum of the error function. This closed form gives us an immediate algorithm to compute the optimal linear estimator. In Python,

avg = lambda L: 1.0* sum(L)/len(L)

def bestLinearEstimator(points):
xAvg, yAvg = map(avg, zip(*points))

aNum = 0
for (x,y) in points:
aNum += (y - yAvg) * x
aDenom += (x - xAvg) * x

a = float(aNum) / aDenom
b = yAvg - a * xAvg
return (a, b), lambda x: a*x + b


and a quick example of its use on synthetic data points:

>>> import random
>>> a = 0.5
>>> b = 7.0
>>> points = [(x, a*x + b + (random.random() * 0.4 - 0.2)) for x in [random.random() * 10 for _ in range(100)]]
>>> bestLinearEstimator(points)[0]
(0.49649543577814137, 6.993035962110321)

## Many Variables and Matrix Form

If we take those two variables $x,y$ and tinker with them a bit, we can represent the solution to our regression problem in a different (a priori strange) way in terms of matrix multiplication.

First, we’ll transform the prediction function into matrixy style. We add in an extra variable $x_0$ which we force to be 1, and then we can write our prediction line in a vector form as $\mathbf{w} = (a,b)$. What is the benefit of such an awkward maneuver? It allows us to write the evaluation of our prediction function as a dot product

$\displaystyle \hat{f}(x_0, x) = \left \langle (x_0, x), (b, a) \right \rangle = x_0b + ax = ax + b$

Now the notation is starting to get quite ugly, so let’s rename the coefficients of our line $\mathbf{w} = (w_0, w_1)$, and the coefficients of the input data $\mathbf{x} = (x_0, x_1)$. The output is still $y$. Here we understand implicitly that the indices line up: if $w_0$ is the constant term, then that makes $x_0 = 1$ our extra variable (often called a bias variable by statistically minded folks), and $x_1$ is the linear term with coefficient $w_1$. Now we can just write the prediction function as

$\hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$

We still haven’t really seen the benefit of this vector notation (and we won’t see it’s true power until we extend this to kernel ridge regression in the next post), but we do have at least one additional notational convenience: we can add arbitrarily many input variables without changing our notation.

If we expand our horizons to think of the random variable $Y$ depending on the $n$ random variables $X_1, \dots, X_n$, then our data will come in tuples of the form $(\mathbf{x}, y) = ((x_0, x_1, \dots, x_n), y)$, where again the $x_0$ is fixed to 1. Then expanding our line $\mathbf{w} = (w_0 , \dots, w_n)$, our evaluation function is still $\hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$. Excellent.

Now we can write our error function using the same style of compact notation. In this case, we will store all of our input data points $\mathbf{x}_j$ as rows of a matrix $X$ and the output values $y_j$ as entries of a vector $\mathbf{y}$. Forgetting the boldface notation and just understanding everything as a vector or matrix, we can write the deviation of the predictor (on all the data points) from the true values as

$y - Xw$

Indeed, each entry of the vector $Xw$ is a dot product of a row of $X$ (an input data point) with the coefficients of the line $w$. It’s just $\hat{f}$ applied to all the input data and stored as the entries of a vector. We still have the sign issue we did before, and so we can just take the square norm of the result and get the same effect as before:

$\displaystyle S(w) = \| y - Xw \|^2$

This is just taking a dot product of $y-Xw$ with itself. This form is awkward to differentiate because the variable $w$ is nested in the norm. Luckily, we can get the same result by viewing $y - Xw$ as a 1-by-$n$ matrix, transposing it, and multiplying by $y-Xw$.

$\displaystyle S(w) = (y - Xw)^{\textup{T}}(y-Xw) = y^{\textup{T}}y -2w^{\textup{T}}X^{\textup{T}}y + w^{\textup{T}}X^{\textup{T}}Xw$

This notation is widely used, in particular because we have nice formulas for calculus on such forms. And so we can compute a gradient of $S$ with respect to each of the $w_i$ variables in $w$ at the same time, and express the result as a vector. This is what taking a “partial derivative” with respect to a vector means: we just represent the system of partial derivates with respect to each entry as a vector. In this case, and using formula 61 from page 9 and formula 120 on page 13 of The Matrix Cookbook, we get

$\displaystyle \frac{\partial S}{\partial w} = -2X^{\textup{T}}y + 2X^{\textup{T}}Xw$

Indeed, it’s quite trivial to prove the latter formula, that for any vector $x$, the partial $\frac{\partial x^{\textup{T}}x}{\partial x} = 2x$. If the reader feels uncomfortable with this, we suggest taking the time to unpack the notation (which we admittedly just spent so long packing) and take a classical derivative entry-by-entry.

Solving the above quantity for $w$ gives $w = (X^{\textup{T}}X)^{-1}X^{\textup{T}}y$, assuming the inverse of $X^{\textup{T}}X$ exists. Again, we’ll spare the details proving that this is a minimum of the error function, but inspecting second derivatives provides this.

Now we can have a slightly more complicated program to compute the linear estimator for one input variable and many output variables. It’s “more complicated” in that much more mathematics is happening behind the code, but just admire the brevity!

from numpy import array, dot, transpose
from numpy.linalg import inv

def bestLinearEstimatorMV(points):
# input points are n+1 tuples of n inputs and 1 output
X = array([[1] + list(p[:-1]) for p in points]) # add bias as x_0
y = array([p[-1] for p in points])

Xt = transpose(X)
theInverse = inv(dot(Xt, X))
w = dot(dot(theInverse, Xt), y)
return w, lambda x: dot(w, x)


Here are some examples of its use. First we check consistency by verifying that it agrees with the test used in the two-variable case (note the reordering of the variables):

>>> print(bestLinearEstimatorMV(points)[0])
[ 6.97687136  0.50284939]

And a more complicated example:

>>> trueW = array([-3,1,2,3,4,5])
>>> bias, linearTerms = trueW[0], trueW[1:]
>>> points = [tuple(v) + (dot(linearTerms, v) + bias + noise(),) for v in [numpy.random.random(5) for _ in range(100)]]
>>> print(bestLinearEstimatorMV(points)[0])
[-3.02698484  1.03984389  2.01999929  3.0046756   4.01240348  4.99515123]

As a quick reminder, all of the code used in this post is available on this blog’s Github page.

## Bias and Variance

There is a deeper explanation of the linear model we’ve been studying. In particular, there is a general technique in statistics called maximum likelihood estimation. And, to be as concise as possible, the linear regression formulas we’ve derived above provide the maximum likelihood estimator for a line with symmetric “Gaussian noise.” Rather than go into maximum likelihood estimation in general, we’ll just describe what it means to be a “line with Gaussian noise,” and measure the linear model’s bias and variance with respect to such a model. We saw this very briefly in the test cases for the code in the past two sections. Just a quick warning: the proofs we present in this section will use the notation and propositions of basic probability theory we’ve discussed on this blog before.

So what we’ve done so far in this post is describe a computational process that accepts as input some points and produces as output a line. We have said nothing about the quality of the line, and indeed we cannot say anything about its quality without some assumptions on how the data was generated.  In usual statistical fashion, we will assume that the true data is being generated by an actual line, but with some added noise.

Specifically, let’s return to the case of two random variables $X,Y$. If we assume that $Y$ is perfectly determined by $X$ via some linear equation $Y = aX + b$, then as we already mentioned we can produce a perfect estimator using a mere two examples. On the other hand, what if every time we take a new $x$ example, its corresponding $y$ value is perturbed by some random coin flip (flipped at the time the example is produced)? Then the value of $y_i$ would be $y_i = ax_i + b + \eta_i$, and we say all the $\eta_i$ are drawn independently and uniformly at random from the set $\left \{ -1,1 \right \}$. In other words, with probability 1/2 we get -1, and otherwise 1, and none of the $\eta_i$ depend on each other. In fact, we just want to make the blanket assumption that the noise doesn’t depend on anything (not the data drawn, the method we’re using to solve the problem, what our favorite color is…). In the notation of random variables, we’d call $H$ the random variable producing the noise (in Greek $H$ is the capital letter for $\eta$), and write $Y = aX + b + H$.

More realistically, the noise isn’t chosen uniformly from $\pm 1$, but is rather chosen to be Gaussian with mean $0$ and some variance $\sigma$. We’d denote this by $\eta_i \sim N(\mu, \sigma)$, and say the $\eta_i$ are drawn independently from this normal distribution. If the reader is uncomfortable with Gaussian noise (it’s certainly a nontrivial problem to generate it computationally), just stick to the noise we defined in the previous paragraph. For the purpose of this post, any symmetric noise will result in the same analysis (and the code samples above use uniform noise over an interval anyway).

Moving back to the case of many variables, we assume our data points $y$ are given by $y = Xw + H$ where $X$ is the observed data and $H$ is Gaussian noise with mean zero and some (unknown) standard deviation $\sigma$. Then if we call $\hat{w}$ our predicted linear coefficients (randomly depending on which samples are drawn), then its expected value conditioned on the data is

$\displaystyle \textup{E}(\hat{w} | X) = \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}y | X)$

Replacing $y$ by $Xw + H$,

$\displaystyle \begin{array} {lcl} \textup{E}(\hat{w} | X) & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}(Xw + H) | X) \\ & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}Xw + (X^{\textup{T}}X)^{-1}X^{\textup{T}}H | X) \end{array}$

Notice that the first term is a fat matrix ($X^{\textup{T}}X$) multiplied by its own inverse, so that cancels to 1. By linearity of expectation, we can split the resulting expression up as

$\textup{E}(w | X) + (X^{\textup{T}}X)^{-1}X^{\textup{T}}\textup{E}(H | X)$

but $w$ is constant (so its expected value is just itself) and $\textup{E}(H | X) = 0$ by assumption that the noise is symmetric. So then the expected value of $\hat{w}$ is just $w$. Because this is true for all choices of data $X$, the bias of our estimator is zero.

The question of variance is a bit trickier, because the variance of the entries of $\hat{w}$ actually do depend on which samples are drawn. Briefly, to compute the covariance matrix of the $w_i$ as variables depending on $X$, we apply the definition:

$\textup{Var}(\hat{w} | X) = \textup{E}(\| w - \textup{E}(w) \|^2 | X)$

And after some tedious expanding and rewriting and recalling that the covariance matrix of $H$ is just the diagonal matrix $\sigma^2 I_n$, we get that

$\textup{Var}(\hat{w} | X) = \sigma^2 (X^{\textup{T}}X)^{-1}$

This means that if we get unlucky and draw some sample which makes some entries of $(X^{\textup{T}}X)^{-1}$ big, then our estimator will vary a lot from the truth. This can happen for a variety of reasons, one of which is including irrelevant input variables in the computation. Unfortunately a deeper discussion of the statistical issues that arise when trying to make hypotheses in such situations. However, the concept of a bias-variance tradeoff is quite relevant. As we’ll see next time, a technique called ridge-regression sacrifices some bias in this standard linear regression model in order to dampen the variance. Moreover, a “kernel trick” allows us to make non-linear predictions, turning this simple model for linear estimation into a very powerful learning tool.

Until then!

# Cauchy-Schwarz Inequality (and Amplification)

Problem: Prove that for vectors $v, w$ in an inner product space, the inequality

$\displaystyle |\left \langle v, w \right \rangle | \leq \| v \| \| w \|$

Solution: There is an elementary proof of the Cauchy-Schwarz inequality (see the Wikipedia article), and this proof is essentially the same. What makes this proof stand out is its insightful technique, which I first read about on Terry Tao’s blog. He calls it “textbook,” and maybe it is for an analyst, but it’s still very elegant.

We start by observing another inequality we know to be true, that $\| v - w \|^2 = \left \langle v - w, v - w \right \rangle \geq 0$, since norms are by definition nonnegative. By the properties of a complex inner product we can expand to get

$\displaystyle \| v \|^2 - 2 \textup{Re}(\left \langle v,w \right \rangle) + \| w \|^2 \geq 0$

or equivalently

$\displaystyle \textup{Re}(\left \langle v,w \right \rangle) \leq \frac{1}{2} \| v \|^2 + \frac{1}{2} \| w \|^2$

This inequality is close to the one we’re looking for, but ‘weaker’ because the inequality we seek squeezes inside the inequality we have. That is,

$\displaystyle \textup{Re}(\left \langle v,w \right \rangle) \leq |\left \langle v, w \right \rangle | \leq \| v \| \| w \| \leq \frac{1}{2} \| v \|^2 + \frac{1}{2} \| w \|^2$

The first inequality is trivial (a complex number is always greater than its real part), the second is the inequality we seek to prove, and the third is a consequence of the arithmetic-geometric mean inequality. And so we have an inequality we’d like to “tighten” to get the true theorem. We do this by tightening each side of the inequality separately, and we do each by exploiting symmetries in the expressions involved.

First, we observe that norms of vectors are preserved by (complex) rotations $v \mapsto e^{i \theta}v$, but the real part is not. Since this inequality is true no matter which vectors we choose, we can choose $\theta$ to our advantage. That is,

$\displaystyle \textup{Re}(\left \langle e^{i \theta}v, w \right \rangle) \leq \frac{1}{2} \| e^{i \theta}v \|^2 + \frac{1}{2} \| w \|^2$

And by properties of inner products and norms (pulling out scalars) and the fact that $|e^{i\theta}| = 1$, we can simplify to

$\displaystyle \textup{Re}(e^{i \theta}\left \langle v,w \right \rangle) \leq \frac{1}{2}\| v \|^2 + \frac{1}{2} \| w \|^2$

where $\theta$ is arbitrary. Since we want to maximize the left hand side as much as possible, we can choose $\theta$ to be whatever is required to make the number real. Then the real part is just the absolute value of the number itself, and we have

$\displaystyle \left |\langle v,w \right \rangle | \leq \frac{1}{2} \| v \|^2 + \frac{1}{2} \| w \|^2$

Now we tighten the right hand side by exploiting a symmetry in inner products: the transformation $(v,w) \mapsto (\lambda v, \frac{1}{\lambda} w)$ preserves the left hand side (since $|\lambda / \bar{\lambda}| = 1$) but not the right. And so by the same reasoning, we can transform the above inequality into

$\displaystyle \left |\langle v,w \right \rangle | \leq \frac{\lambda^2}{2} \| v \|^2 + \frac{1}{2 \lambda^2} \| w \|^2$

And by plugging in $\lambda = \sqrt{\| w \| / \| v \|}$ (indeed, this minimizes the expression for nonzero $v,w$) we get exactly the Cauchy-Schwarz inequality, as desired.

$\square$

This technique is termed “amplification” by Tao, and in his blog post he gives quite a few more advanced examples in harmonic and functional analysis (which are far beyond the scope of this blog).   The asymmetrical symmetry we took advantage of is a sort of “arbitrage” (again Terry’s clever choice of words) to take a weak fact and boost it to a stronger fact. And while the details of this proof are quite trivial, the technique of actively looking for one-sided symmetries is difficult to forget.

# Functoriality

Last time we worked through some basic examples of universal properties, specifically singling out quotients, products, and coproducts. There are many many more universal properties that we will mention as we encounter them, but there is one crucial topic in category theory that we have only hinted at: functoriality.

As we’ve repeatedly stressed, the meat of category theory is in the morphisms. One natural question one might ask is, what notion of morphism is there between categories themselves? Indeed, the most straightforward way to see category theoretic concepts in classical mathematics is in a clever choice of functor. For example (and this example isn’t necessary for the rest of the article) one can “associate” to each topological space a group, called the homology group, in such a way that continuous functions on topological spaces translate to group homomorphisms. Moreover, this translation is functorial in the following sense: the group homomorphism associated to a composition is the composition of the associated group homomorphisms. If we denote the association by a subscripted asterisk, then we get the following common formula.

$\displaystyle (fg)_* = f_* g_*$

This is the crucial property that maintains the structure of morphisms. Again, this should reinforce the idea that the crucial ingredient of every definition in category theory is its effect on morphisms.

## Functors: a Definition

In complete generality, a functor is a mapping between two categories which preserves the structure of morphisms. Formally,

Definition: Let $\mathbf{C}, \mathbf{D}$ be categories. A functor $\mathscr{F}$ consists of two parts:

• For each object $C \in \mathbf{C}$ an associated object $\mathscr{F}(C) \in \mathbf{D}$.
• For each morphism $f \in \textup{Hom}_{\mathbf{C}}(A,B)$ a corresponding morphism $\mathscr{F}(f) \in \textup{Hom}_{\mathbf{D}}(\mathscr{F}(A), \mathscr{F}(B))$. Specifically, for each $A,B$ we have a set-function $\textup{Hom}_{\mathbf{C}}(A,B) \to \textup{Hom}_{\mathbf{C}}(\mathscr{F}(A), \mathscr{F}(B))$.

There are two properties that a functor needs to satisfy to “preserve structure.” The first is that the identity morphisms are preserved for every object; that is, $\mathscr{F}(1_A) = 1_{\mathscr{F}(A)}$ for every object $A$. Second, composition must be preserved. That is, if $f \in \textup{Hom}_{\mathbf{C}}(A,B)$ and $g \in \textup{Hom}_{\mathbf{C}}(B,C)$, we have

$\displaystyle \mathscr{F}(gf) = \mathscr{F}(g) \mathscr{F}(f)$

We often denote a functor as we would a function $\mathscr{F}: \mathbf{C} \to \mathbf{D}$, and use the function application notation as if everything were with sets.

Let’s look at a few simple examples.

Let $\mathbf{FiniteSet}$ be the poset category of finite sets with subsets as morphisms, and let $\mathbf{Int}$ be the category whose objects are integers where there is a unique morphism from $x \to y$ if $x \leq y$. Then the size function is a functor $\mathbf{Set} \to \mathbf{Int}$. Continuing with $\mathbf{Int}$, remember that $\mathbf{Int}$ forms a group under addition (also known as $\mathbb{Z}$). And so by its very definition any group homomorphism $\mathbb{Z} \to \mathbb{Z}$ is a functor from $\mathbf{Int} \to \mathbf{Int}$. A functor from a category to itself is often called an endofunctor.

There are many examples of functors from the category $\mathbf{Top}$ of topological spaces to the category $\mathbf{Grp}$ of groups. These include some examples we’ve seen on this blog, such as the fundamental group and homology groups.

One trivial example of a functor is called the forgetful functor. Let $\mathbf{C}$ be a category whose objects are sets and whose morphisms are set-maps with additional structure, for example the category of groups. Define a functor $\mathbf{C} \to \mathbf{Set}$ which acts as the identity on both sets and functions. This functor simply “forgets” the structure in $\mathbf{C}$. In the realm of programs and types (this author is thinking Java), one can imagine this as a ‘type-cast’ from String to Object.  In the same vein, one could define an “identity” endofunctor $\mathbf{C} \to \mathbf{C}$ which does absolutely nothing.

One interesting way to think of a functor is as a “realization” of one category inside another. In particular, because the composition structure of $\mathbf{C}$ is preserved by a functor $\mathbf{C} \to \mathbf{D}$, it must be the case that all commutative diagrams are “sent” to commutative diagrams. In addition, isomorphisms are sent to isomorphisms because if $f, g$ are inverses of each other, $1_{\mathscr{F}(A)} = \mathscr{F}(1_A) = \mathscr{F}(gf) = \mathscr{F}(g)\mathscr{F}(f)$ and likewise for the reverse composition. And so if we have a functor $\mathscr{F}$ from a poset category (say, the real numbers with the usual inequality) to some category $\mathbf{C}$, then we can realize the structure of the poset sitting inside of $\mathbf{C}$ (perhaps involving only some of the objects of $\mathbf{C}$). This view comes in handy in a few places we’ll see later in our series on computational topology.

## The Hom Functor

There is a very important and nontrivial example called the “hom functor” which is motivated by the category of vector spaces. We’ll stick to the concrete example of vector spaces, but the generalization to arbitrary categories is straightforward. If the reader knows absolutely nothing about vector spaces, replace “vector space” with “object” and “linear map” with “morphism.” It won’t quite be correct, but it will get the idea across.

To each vector space $V$ one can define a dual vector space of functions $V \to \mathbb{C}$ (or whatever the field of scalars for $V$ is). Following the lead of hom sets, the dual vector space is denoted $\textup{Hom}(V,\mathbb{C})$. Here the morphisms in the set are those from the category of vector spaces (that is, linear maps $V \to \mathbb{C}$). Indeed, this is a vector space: one can add two functions pointwise ($(f+g)(v) = f(v) + g(v)$) and scale them ($(\lambda f)(v) = \lambda f(v)$), and the properties for a vector spaces are trivial to check.

Now the mapping $\textup{Hom}(-, \mathbb{C})$ which takes $V$ and produces $\textup{Hom}(V, \mathbb{C})$ is a functor called the hom functor. But let’s inspect this one more closely. The source category is obviously the category of vector spaces, but what is the target category? The objects are clear: the hom sets $\textup{Hom}(V,\mathbb{C})$ where $V$ is a vector space. The morphisms of the category are particularly awkward. Officially, they are written as

$\displaystyle \textup{Hom}(\textup{Hom}(V, \mathbb{C}), \textup{Hom}(W, \mathbb{C}))$

so a morphism in this category takes as input a linear map $V \to \mathbb{C}$ and produces as output one $W \to \mathbb{C}$. But what are the morphisms in words we can understand? And how can we compose them? Before reading on, think about what a morphism of morphisms should look like.

The morphisms in this category can be thought of as linear maps $W \to V$. More specifically, given a morphism $\varphi: V \to \mathbb{C}$ and a linear map $g: W \to V$, we can construct a linear map $W \to \mathbb{C}$ by composing $\varphi g$.

And so if we apply the $\textup{Hom}$ functor to a morphism $f: W \to V$, we get a morphism in $\textup{Hom}(\textup{Hom}(V, \mathbb{C}), \textup{Hom}(W, \mathbb{C}))$. Let’s denote the application of the hom functor using an asterisk so that $f \mapsto f_*$.

But wait a minute! The mapping here is going in the wrong direction: we took a map in one category going from the $V$ side to the $W$ side, and after applying the functor we got a map going from the $W$ side ($\textup{Hom}(W, \mathbb{C})$) to the $V$ side ($\textup{Hom}(V, \mathbb{C})$). It seems there is no reasonable way to take a map $V \to \mathbb{C}$ and get a map in $W \to \mathbb{C}$ using just $f$, but the other way is obvious. The hom functor “goes backward” in a sense. In other words, the composition property for our “functor” makes the composite $(g f)_*$ to the map taking $\varphi$ to $\varphi g f$. On the other hand, there is no way to compose $g_* f_*$, as they operate on the wrong domains! It must be the other way around:

$\displaystyle (gf)_* = f_* g_*$

We advise the reader to write down the commutative diagram and trace out the compositions to make sure everything works out. But this is a problem, because it makes the hom functor fail the most important requirement. In order to fix this reversal “problem,” we make the following definition:

Definition: A functor $\mathscr{F} : \mathbf{C} \to \mathbf{D}$ is called covariant if it preserves the order of morphism composition, so that $\mathscr{F}(gf) = \mathscr{F}(g) \mathscr{F}(f)$. If it reverses the order, we call it contravariant.

And so the hom functor on vector spaces is a contravariant functor, while all of the other functors we’ve defined in this post are covariant.

There is another way to describe a contravariant functor as a covariant functor which is often used. It involves the idea of an “opposite” category. For any category $\mathbf{C}$ we can define the opposite category $\mathbf{C}^{\textup{op}}$ to be a category with the same objects as $\mathbf{C}$, but with all morphisms reversed. That is, we define

$\displaystyle \textup{Hom}_{\mathbf{C}^{\textup{op}}}(A,B) = \textup{Hom}_{\mathbf{C}}(B,A)$

We leave it to the reader to verify that this is indeed a category. It is also not hard to see that $(\mathbf{C}^{\textup{op}})^{\textup{op}} = \mathbf{C}$. Opposite categories give us a nice recharacterization of a contrvariant functor. Indeed, because composition in opposite categories is reversed, a contravariant functor $\mathbf{C} \to \mathbf{D}$ is just a covariant functor on the opposite category $\mathbf{C}^{\textup{op}} \to \mathbf{D}$. Or equivalently, one $\mathbf{C} \to \mathbf{D}^{\textup{op}}$. More than anything, opposite categories are syntactical sugar. Composition is only reversed artificially to make domains and codomains line up, but the actual composition is the same as in the original category.

## Functors as Types

Before we move on to some code, let’s take a step back and look at the big picture (we’ve certainly plowed through enough details up to this point). The main thesis is that functoriality is a valuable property for an operation to have, but it’s not entirely clear why. Even the brightest of readers can only assume such properties are useful for mathematical analysis. It seems that the question we started this series out with, “what does category theory allow us to do that we couldn’t do before?” still has the answer, “nothing.” More relevantly, the question of what functoriality allows us to do is unclear. Indeed, once again the answer is “nothing.” Rather, functoriality in a computation allows one to analyze the behavior of a program. It gives the programmer a common abstraction in which to frame operations, and ease in proving the correctness of one’s algorithms.

In this light, the best we can do in implementing functors in programs is to give a type definition and examples. And in this author’s opinion this series is quickly becoming boring (all of the computational examples are relatively lame), so we will skip the examples in favor of the next post which will analyze more meaty programming constructs from a categorical viewpoint.

So recall the ML type definition of a category, a tuple of operations for source, target, identity, and composition:

datatype ('object, 'arrow)Category =
category of ('arrow -> 'object) *
('arrow -> 'object) *
('object -> 'arrow) *
('arrow * 'arrow -> 'arrow)

And so a functor consists of the two categories involved (as types), and the mapping on objects, and the mapping on morphisms.

datatype ('cObject, 'cArrow, 'dObject, 'dArrow)Functor =
aFunctor of ('cObject, 'cArrow)Category *
('cObject -> 'dObject) *
('cArrow -> 'dArrow) *
('dObject -> 'dArrow)Category

We encourage the reader who is uncomfortable with these type definitions to experiment with them by implementing some of our simpler examples (say, the size functor from sets to integers). Insofar as the basic definitions go, functors are not all that interesting. They become much more interesting when additional structure is imposed on them, and in the distant future we will see a glimpse of this in the form of adjointness. We hope to get around to analyzing statements like “syntax and semantics are adjoint functors.” For the next post in this series, we will take the three beloved functions of functional programming (map, foldl(r), and filter), and see what their categorical properties are.

Until then!

# Computing Homology

In our last post in this series on topology, we defined the homology group. Specifically, we built up a topological space as a simplicial complex (a mess of triangles glued together), we defined an algebraic way to represent collections of simplices called chains as vectors in a vector space, we defined the boundary homomorphism as a linear map on chains, and finally defined the homology groups as the quotient vector spaces

$\displaystyle H_k(X) = \frac{\textup{ker} \partial_k}{\textup{im} \partial_{k+1}}$.

The number of holes in $X$ was just the dimension of this quotient space.

In this post we will be quite a bit more explicit. Because the chain groups are vector spaces and the boundary mappings are linear maps, they can be represented as matrices whose dimensions depend on our simplicial complex structure. Better yet, if we have explicit representations of our chains by way of a basis, then we can use row-reduction techniques to write the matrix in a standard form.

Of course the problem arises when we want to work with two matrices simultaneously (to compute the kernel-mod-image quotient above). This is not computationally any more difficult, but it requires some theoretical fiddling. We will need to dip a bit deeper into our linear algebra toolboxes to see how it works, so the rusty reader should brush up on their linear algebra before continuing (or at least take some time to sort things out if or when confusion strikes).

Without further ado, let’s do an extended example and work our ways toward a general algorithm. As usual, all of the code used for this post is available on this blog’s Github page.

## Two Big Matrices

Recall our example simplicial complex from last time.

We will compute $H_1$ of this simplex (which we saw last time was $\mathbb{Q}$) in a more algorithmic way than we did last time.

Once again, we label the vertices 0-4 so that the extra “arm” has vertex 4 in the middle, and its two endpoints are 0 and 2. This gave us orientations on all of the simplices, and the following chain groups. Since the vertex labels (and ordering) are part of the data of a simplicial complex, we have made no choices in writing these down.

$\displaystyle C_0(X) = \textup{span} \left \{ 0,1,2,3,4 \right \}$

$\displaystyle C_1(X) = \textup{span} \left \{ [0,1], [0,2], [0,3], [0,4], [1,2], [1,3],[2,3],[2,4] \right \}$

$\displaystyle C_2(X) = \textup{span} \left \{ [0,1,2], [0,1,3], [0,2,3], [1,2,3] \right \}$

Now given our known definitions of $\partial_k$ as an alternating sum from last time, we can give a complete specification of the boundary map as a matrix. For $\partial_1$, this would be

$\displaystyle \partial_1 = \bordermatrix{ & [0,1] & [0,2] & [0,3] & [0,4] & [1,2] & [1,3] & [2,3] & [2,4] \cr 0 & -1 & -1 & -1 & -1 & 0 & 0 & 0 & 0\cr 1 & 1 & 0 & 0 & 0 & -1 & -1 & 0 & 0\cr 2 & 0 & 1 & 0 & 0 & 1 & 0 & -1 & -1 \cr 3 & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 \cr 4 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 }$,

where the row labels are the basis for $C_0(X)$ and the column labels are the basis for $C_1(X)$. Similarly, $\partial_2$ is

$\displaystyle \partial_2 = \bordermatrix{ & [0,1,2] & [0,1,3] & [0,2,3] & [1,2,3] \cr [0,1] & 1 & 1 & 0 & 0\cr [0,2] & -1 & 0 & 1 & 0\cr [0,3] & 0 & -1 & -1 & 0\cr [0,4] & 0 & 0 & 0 & 0\cr [1,2] & 1 & 0 & 0 & 1\cr [1,3] & 0 & 1 & 0 & -1\cr [2,3] & 0 & 0 & 1 & 1\cr [2,4] & 0 & 0 & 0 & 0}$

The reader is encouraged to check that these matrices are written correctly by referring to the formula for $\partial$ as given last time.

Remember the crucial property of $\partial$, that $\partial^2 = \partial_k \partial_{k+1} = 0$. Indeed, the composition of the two boundary maps just corresponds to the matrix product of the two matrices, and one can verify by hand that the above two matrices multiply to the zero matrix.

We know from basic linear algebra how to compute the kernel of a linear map expressed as a matrix: column reduce and inspect the columns of zeros. Since the process of row reducing is really a change of basis, we can encapsulate the reduction inside a single invertible matrix $A$, which, when left-multiplied by $\partial$, gives us the reduced form of the latter. So write the reduced form of $\partial_1$ as $\partial_1 A$.

However, now we’re using two different sets of bases for the shared vector space involved in $\partial_1$ and $\partial_2$. In general, it will no longer be the case that $\partial_kA\partial_{k+1} = 0$. The way to alleviate this is to perform the “corresponding” change of basis in $\partial_{k+1}$. To make this idea more transparent, we return to the basics.

## Changing Two Matrices Simultaneously

Recall that a matrix $M$ represents a linear map between two vector spaces $f : V \to W$. The actual entries of $M$ depend crucially on the choice of a basis for the domain and codomain. Indeed, if $v_i$ form a basis for $V$ and $w_j$ for $W$, then the $k$-th column of the matrix representation $M$ is defined to be the coefficients of the representation of $f(v_k)$ in terms of the $w_j$. We hope to have nailed this concept down firmly in our first linear algebra primer.

Recall further that row operations correspond to changing a basis for the codomain, and column operations correspond to changing a basis for the domain. For example, the idea of swapping columns $i,j$ in $M$ gives a new matrix which is the representation of $f$ with respect to the (ordered) basis for $V$ which swaps the order of $v_i , v_j$. Similar things happen for all column operations (they all correspond to manipulations of the basis for $V$), while analogously row operations implicitly transform the basis for the codomain. Note, though, that the connection between row operations and transformations of the basis for the codomain are slightly more complicated than they are for the column operations. We will explicitly see how it works later in the post.

And so if we’re working with two maps $A: U \to V$ and $B: V \to W$, and we change a basis for $V$ in $B$ via column reductions, then in order to be consistent, we need to change the basis for $V$ in $A$ via “complementary” row reductions. That is, if we call the change of basis matrix $Q$, then we’re implicitly sticking $Q$ in between the composition $BA$ to get $(BQ)A$. This is not the same map as $BA$, but we can make it the same map by adding a $Q^{-1}$ in the right place:

$\displaystyle BA = B(QQ^{-1})A = (BQ)(Q^{-1}A)$

Indeed, whenever $Q$ is a change of basis matrix so is $Q^{-1}$ (trivially), and moreover the operations that $Q$ performs on the columns of $B$ are precisely the operations that $Q^{-1}$ performs on the rows of $A$ (this is because elementary row operations take different forms when multiplied on the left or right).

Coming back to our boundary operators, we want a canonical way to view the image of $\partial_{k+1}$ as sitting inside the kernel of $\partial_k$. If we go ahead and use column reductions to transform $\partial_k$ into a form where the kernel is easy to read off (as the columns consisting entirely of zeroes), then the corresponding row operations, when performed on $\partial_{k+1}$ will tell us exactly the image of $\partial_{k+1}$ inside the kernel of $\partial_k$.

This last point is true precisely because $\textup{im} \partial_{k+1} \subset \textup{ker} \partial_k$. This fact guarantees that the irrelevant rows of the reduced version of $\partial_{k+1}$ are all zero.

Let’s go ahead and see this in action on our two big matrices above. For $\partial_1$, the column reduction matrix is

$\displaystyle A = \begin{pmatrix} 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & -1 & 0 & 1 & 1\\ 0 & 0 & 0 & 1 & 0 & -1 & -1 & 0\\ -1 & -1 & -1 & -1 & 0 & 0 & 0 & -1\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix}$

And the product $\partial_1 A$ is

$\displaystyle \partial_1 A = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ -1 & -1 & -1 & -1 & 0 & 0 & 0 & 0 \end{pmatrix}$

Now the inverse of $A$, which is the corresponding basis change for $\partial_2$, is

$\displaystyle A^{-1} = \begin{pmatrix} -1 & -1 & -1 & -1 & -0 & -0 & -0 & -0\\ 1 & 0 & 0 & 0 & -1 & -1 & 0 & 0\\ 0 & 1 & 0 & 0 & 1 & 0 & -1 & -1\\ 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix}$

and the corresponding reduced form of $\partial_2$ is

$\displaystyle A^{-1} \partial_2 = \begin{pmatrix} 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 1 & 0 & 0 & 1\\ 0 & 1 & 0 & -1\\ 0 & 0 & 1 & 1\\ 0 & 0 & 0 & 0 \end{pmatrix}$

As a side note, we got these matrices by slightly modifying the code from our original post on row reduction to output the change of basis matrix in addition to performing row reduction. It turns out one can implement column reduction as row reduction of the transpose, and the change of basis matrix you get from this process will be the transpose of the change of basis matrix you want (by $(AB)^\textup{T} = (B^\textup{T}A^\textup{T})$). Though the code is particularly ad-hoc, we include it with the rest of the code used in this post on this blog’s Github page.

Now let’s inspect the two matrices $\partial_1 A$ and $A^{-1} \partial_2$ more closely. The former has four “pivots” left over, and this corresponds to the rank of the matrix being 4. Moreover, the four basis vectors representing the columns with nonzero pivots, which we’ll call $v_1, v_2, v_3, v_4$ (we don’t care what their values are), span a complementary subspace to the kernel of $\partial_1$. Hence, the remaining four vectors (which we’ll call $v_5, v_6, v_7, v_8$) span the kernel. In particular, this says that the kernel has dimension 4.

On the other hand, we performed the same transformation of the basis of $C_1(X)$ for $\partial_2$. Looking at the matrix that resulted, we see that the first four rows and the last row (representing $v_1, v_2, v_3, v_4, v_8$) are entirely zeros and so the image of $\partial_2$ intersects their span trivially. and the remaining three rows (representing $v_5, v_6, v_7$) have nonzero pivots. This tells us exactly that the image of $\partial_2$ is spanned by $v_5, v_6, v_7$.

And now, the coup de grâce, the quotient to get homology is simply

$\displaystyle \frac{ \textup{span} \left \{ v_5, v_6, v_7, v_8 \right \}}{ \textup{span} \left \{ v_5, v_6, v_7 \right \}} = \textup{span} \left \{ v_8 \right \}$

And the dimension of the homology group is 1, as desired.

## The General Algorithm

It is no coincidence that things worked out at nicely as they did. The process we took of simultaneously rewriting two matrices with respect to a common basis is the bulk of the algorithm to compute homology. Since we’re really only interested in the dimensions of the homology groups, we just need to count pivots. If the number of pivots arising in $\partial_k$ is $y$ and the number of pivots arising in $\partial_{k+1}$ is $z$, and the dimension of $C_k(X)$ is $n$, then the dimension is exactly

$(n-y) - z = \textup{dim}(\textup{ker} \partial_k) - \textup{dim}(\textup{im}\partial_{k+1})$

And it is no coincidence that the pivots lined up so nicely to allow us to count dimensions this way. It is a minor exercise to prove it formally, but the fact that the composition $\partial_k \partial_{k+1} = 0$ implies that the reduced version of $\partial_{k+1}$ will have an almost reduced row-echelon form (the only difference being the rows of zeros interspersed above, below, and possibly between pivot rows).

As the reader may have guessed at this point, we don’t actually need to compute $A$ and $A^{-1}$. Instead of this, we can perform the column/row reductions simultaneously on the two matrices. The above analysis helped us prove the algorithm works, and with that guarantee we can throw out the analytical baggage and just compute the damn thing.

Indeed, assuming the input is already processed as two matrices representing the boundary operators with respect to the standard bases of the chain groups, computing homology is only slightly more difficult than row reducing in the first place. Putting our homology where our mouth is, we’ve implemented the algorithm in Python. As usual, the entire code used in this post is available on this blog’s Github page.

The first step is writing auxiliary functions to do elementary row and column operations on matrices. For this post, we will do everything in numpy (which makes the syntax shorter than standard Python syntax, but dependent on the numpy library).

import numpy

def rowSwap(A, i, j):
temp = numpy.copy(A[i, :])
A[i, :] = A[j, :]
A[j, :] = temp

def colSwap(A, i, j):
temp = numpy.copy(A[:, i])
A[:, i] = A[:, j]
A[:, j] = temp

def scaleCol(A, i, c):
A[:, i] *= c*numpy.ones(A.shape[0])

def scaleRow(A, i, c):
A[i, :] *= c*numpy.ones(A.shape[1])

def colCombine(A, addTo, scaleCol, scaleAmt):
A[:, addTo] += scaleAmt * A[:, scaleCol]

def rowCombine(A, addTo, scaleRow, scaleAmt):
A[addTo, :] += scaleAmt * A[scaleRow, :]


From here, the main meat of the algorithm is doing column reduction on one matrix, and applying the corresponding row operations on the other.

def simultaneousReduce(A, B):
if A.shape[1] != B.shape[0]:
raise Exception("Matrices have the wrong shape.")

numRows, numCols = A.shape # col reduce A

i,j = 0,0
while True:
if i >= numRows or j >= numCols:
break

if A[i][j] == 0:
nonzeroCol = j
while nonzeroCol < numCols and A[i,nonzeroCol] == 0:
nonzeroCol += 1

if nonzeroCol == numCols:
j += 1
continue

colSwap(A, j, nonzeroCol)
rowSwap(B, j, nonzeroCol)

pivot = A[i,j]
scaleCol(A, j, 1.0 / pivot)
scaleRow(B, j, 1.0 / pivot)

for otherCol in range(0, numCols):
if otherCol == j:
continue
if A[i, otherCol] != 0:
scaleAmt = -A[i, otherCol]
colCombine(A, otherCol, j, scaleAmt)
rowCombine(B, j, otherCol, -scaleAmt)

i += 1; j+= 1

return A,B


This more or less parallels the standard algorithm for row-reduction (with the caveat that all the indices are swapped to do column-reduction). The only somewhat confusing line is the call to rowCombine, which explicitly realizes the corresponding row operation as the inverse of the performed column operation. Note that for row operations, the correspondence between operations on the basis and operations on the rows is not as direct as it is for columns. What’s given above is the true correspondence. Writing down lots of examples will reveal why, and we leave that as an exercise to the reader.

Then the actual algorithm to compute homology is just a matter of counting pivots. Here are two pivot-counting functions in a typical numpy fashion:

def numPivotCols(A):
z = numpy.zeros(A.shape[0])
return [numpy.all(A[:, j] == z) for j in range(A.shape[1])].count(False)

def numPivotRows(A):
z = numpy.zeros(A.shape[1])
return [numpy.all(A[i, :] == z) for i in range(A.shape[0])].count(False)


And the final function is just:

def bettiNumber(d_k, d_kplus1):
A, B = numpy.copy(d_k), numpy.copy(d_kplus1)
simultaneousReduce(A, B)

dimKChains = A.shape[1]
kernelDim = dimKChains - numPivotCols(A)
imageDim = numPivotRows(B)

return kernelDim - imageDim


And there we have it! We’ve finally tackled the beast, and written a program to compute algebraic features of a topological space.

The reader may be curious as to why we didn’t come up with a more full-bodied representation of a simplicial complex and write an algorithm which accepts a simplicial complex and computes all of its homology groups. We’ll leave this direct approach as a (potentially long) exercise to the reader, because coming up in this series we are going to do one better. Instead of computing the homology groups of just one simplicial complex using by repeating one algorithm many times, we’re going to compute all the homology groups of a whole family of simplicial complexes in a single bound. This family of simplicial complexes will be constructed from a data set, and so, in grandiose words, we will compute the topological features of data.

If it sounds exciting, that’s because it is! We’ll be exploring a cutting-edge research field known as persistent homology, and we’ll see some of the applications of this theory to data analysis.

Until then!