# Computing Homology

In our last post in this series on topology, we defined the homology group. Specifically, we built up a topological space as a simplicial complex (a mess of triangles glued together), we defined an algebraic way to represent collections of simplices called chains as vectors in a vector space, we defined the boundary homomorphism as a linear map on chains, and finally defined the homology groups as the quotient vector spaces

$\displaystyle H_k(X) = \frac{\textup{ker} \partial_k}{\textup{im} \partial_{k+1}}$.

The number of holes in $X$ was just the dimension of this quotient space.

In this post we will be quite a bit more explicit. Because the chain groups are vector spaces and the boundary mappings are linear maps, they can be represented as matrices whose dimensions depend on our simplicial complex structure. Better yet, if we have explicit representations of our chains by way of a basis, then we can use row-reduction techniques to write the matrix in a standard form.

Of course the problem arises when we want to work with two matrices simultaneously (to compute the kernel-mod-image quotient above). This is not computationally any more difficult, but it requires some theoretical fiddling. We will need to dip a bit deeper into our linear algebra toolboxes to see how it works, so the rusty reader should brush up on their linear algebra before continuing (or at least take some time to sort things out if or when confusion strikes).

Without further ado, let’s do an extended example and work our ways toward a general algorithm. As usual, all of the code used for this post is available on this blog’s Google Code page.

## Two Big Matrices

Recall our example simplicial complex from last time.

We will compute $H_1$ of this simplex (which we saw last time was $\mathbb{Q}$) in a more algorithmic way than we did last time.

Once again, we label the vertices 0-4 so that the extra “arm” has vertex 4 in the middle, and its two endpoints are 0 and 2. This gave us orientations on all of the simplices, and the following chain groups. Since the vertex labels (and ordering) are part of the data of a simplicial complex, we have made no choices in writing these down.

$\displaystyle C_0(X) = \textup{span} \left \{ 0,1,2,3,4 \right \}$

$\displaystyle C_1(X) = \textup{span} \left \{ [0,1], [0,2], [0,3], [0,4], [1,2], [1,3],[2,3],[2,4] \right \}$

$\displaystyle C_2(X) = \textup{span} \left \{ [0,1,2], [0,1,3], [0,2,3], [1,2,3] \right \}$

Now given our known definitions of $\partial_k$ as an alternating sum from last time, we can give a complete specification of the boundary map as a matrix. For $\partial_1$, this would be

$\displaystyle \partial_1 = \bordermatrix{ & [0,1] & [0,2] & [0,3] & [0,4] & [1,2] & [1,3] & [2,3] & [2,4] \cr 0 & -1 & -1 & -1 & -1 & 0 & 0 & 0 & 0\cr 1 & 1 & 0 & 0 & 0 & -1 & -1 & 0 & 0\cr 2 & 0 & 1 & 0 & 0 & 1 & 0 & -1 & -1 \cr 3 & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 \cr 4 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 }$,

where the row labels are the basis for $C_0(X)$ and the column labels are the basis for $C_1(X)$. Similarly, $\partial_2$ is

$\displaystyle \partial_2 = \bordermatrix{ & [0,1,2] & [0,1,3] & [0,2,3] & [1,2,3] \cr [0,1] & 1 & 1 & 0 & 0\cr [0,2] & -1 & 0 & 1 & 0\cr [0,3] & 0 & -1 & -1 & 0\cr [0,4] & 0 & 0 & 0 & 0\cr [1,2] & 1 & 0 & 0 & 1\cr [1,3] & 0 & 1 & 0 & -1\cr [2,3] & 0 & 0 & 1 & 1\cr [2,4] & 0 & 0 & 0 & 0}$

The reader is encouraged to check that these matrices are written correctly by referring to the formula for $\partial$ as given last time.

Remember the crucial property of $\partial$, that $\partial^2 = \partial_k \partial_{k+1} = 0$. Indeed, the composition of the two boundary maps just corresponds to the matrix product of the two matrices, and one can verify by hand that the above two matrices multiply to the zero matrix.

We know from basic linear algebra how to compute the kernel of a linear map expressed as a matrix: column reduce and inspect the columns of zeros. Since the process of row reducing is really a change of basis, we can encapsulate the reduction inside a single invertible matrix $A$, which, when left-multiplied by $\partial$, gives us the reduced form of the latter. So write the reduced form of $\partial_1$ as $\partial_1 A$.

However, now we’re using two different sets of bases for the shared vector space involved in $\partial_1$ and $\partial_2$. In general, it will no longer be the case that $\partial_kA\partial_{k+1} = 0$. The way to alleviate this is to perform the “corresponding” change of basis in $\partial_{k+1}$. To make this idea more transparent, we return to the basics.

## Changing Two Matrices Simultaneously

Recall that a matrix $M$ represents a linear map between two vector spaces $f : V \to W$. The actual entries of $M$ depend crucially on the choice of a basis for the domain and codomain. Indeed, if $v_i$ form a basis for $V$ and $w_j$ for $W$, then the $k$-th column of the matrix representation $M$ is defined to be the coefficients of the representation of $f(v_k)$ in terms of the $w_j$. We hope to have nailed this concept down firmly in our first linear algebra primer.

Recall further that row operations correspond to changing a basis for the codomain, and column operations correspond to changing a basis for the domain. For example, the idea of swapping columns $i,j$ in $M$ gives a new matrix which is the representation of $f$ with respect to the (ordered) basis for $V$ which swaps the order of $v_i , v_j$. Similar things happen for all column operations (they all correspond to manipulations of the basis for $V$), while analogously row operations implicitly transform the basis for the codomain. Note, though, that the connection between row operations and transformations of the basis for the codomain are slightly more complicated than they are for the column operations. We will explicitly see how it works later in the post.

And so if we’re working with two maps $A: U \to V$ and $B: V \to W$, and we change a basis for $V$ in $B$ via column reductions, then in order to be consistent, we need to change the basis for $V$ in $A$ via “complementary” row reductions. That is, if we call the change of basis matrix $Q$, then we’re implicitly sticking $Q$ in between the composition $BA$ to get $(BQ)A$. This is not the same map as $BA$, but we can make it the same map by adding a $Q^{-1}$ in the right place:

$\displaystyle BA = B(QQ^{-1})A = (BQ)(Q^{-1}A)$

Indeed, whenever $Q$ is a change of basis matrix so is $Q^{-1}$ (trivially), and moreover the operations that $Q$ performs on the columns of $B$ are precisely the operations that $Q^{-1}$ performs on the rows of $A$ (this is because elementary row operations take different forms when multiplied on the left or right).

Coming back to our boundary operators, we want a canonical way to view the image of $\partial_{k+1}$ as sitting inside the kernel of $\partial_k$. If we go ahead and use column reductions to transform $\partial_k$ into a form where the kernel is easy to read off (as the columns consisting entirely of zeroes), then the corresponding row operations, when performed on $\partial_{k+1}$ will tell us exactly the image of $\partial_{k+1}$ inside the kernel of $\partial_k$.

This last point is true precisely because $\textup{im} \partial_{k+1} \subset \textup{ker} \partial_k$. This fact guarantees that the irrelevant rows of the reduced version of $\partial_{k+1}$ are all zero.

Let’s go ahead and see this in action on our two big matrices above. For $\partial_1$, the column reduction matrix is

$\displaystyle A = \begin{pmatrix} 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & -1 & 0 & 1 & 1\\ 0 & 0 & 0 & 1 & 0 & -1 & -1 & 0\\ -1 & -1 & -1 & -1 & 0 & 0 & 0 & -1\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix}$

And the product $\partial_1 A$ is

$\displaystyle \partial_1 A = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ -1 & -1 & -1 & -1 & 0 & 0 & 0 & 0 \end{pmatrix}$

Now the inverse of $A$, which is the corresponding basis change for $\partial_2$, is

$\displaystyle A^{-1} = \begin{pmatrix} -1 & -1 & -1 & -1 & -0 & -0 & -0 & -0\\ 1 & 0 & 0 & 0 & -1 & -1 & 0 & 0\\ 0 & 1 & 0 & 0 & 1 & 0 & -1 & -1\\ 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix}$

and the corresponding reduced form of $\partial_2$ is

$\displaystyle A^{-1} \partial_2 = \begin{pmatrix} 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 1 & 0 & 0 & 1\\ 0 & 1 & 0 & -1\\ 0 & 0 & 1 & 1\\ 0 & 0 & 0 & 0 \end{pmatrix}$

As a side note, we got these matrices by slightly modifying the code from our original post on row reduction to output the change of basis matrix in addition to performing row reduction. It turns out one can implement column reduction as row reduction of the transpose, and the change of basis matrix you get from this process will be the transpose of the change of basis matrix you want (by $(AB)^\textup{T} = (B^\textup{T}A^\textup{T})$). Though the code is particularly ad-hoc, we include it with the rest of the code used in this post on this blog’s Google Code page.

Now let’s inspect the two matrices $\partial_1 A$ and $A^{-1} \partial_2$ more closely. The former has four “pivots” left over, and this corresponds to the rank of the matrix being 4. Moreover, the four basis vectors representing the columns with nonzero pivots, which we’ll call $v_1, v_2, v_3, v_4$ (we don’t care what their values are), span a complementary subspace to the kernel of $\partial_1$. Hence, the remaining four vectors (which we’ll call $v_5, v_6, v_7, v_8$) span the kernel. In particular, this says that the kernel has dimension 4.

On the other hand, we performed the same transformation of the basis of $C_1(X)$ for $\partial_2$. Looking at the matrix that resulted, we see that the first four rows and the last row (representing $v_1, v_2, v_3, v_4, v_8$) are entirely zeros and so the image of $\partial_2$ intersects their span trivially. and the remaining three rows (representing $v_5, v_6, v_7$) have nonzero pivots. This tells us exactly that the image of $\partial_2$ is spanned by $v_5, v_6, v_7$.

And now, the coup de grâce, the quotient to get homology is simply

$\displaystyle \frac{ \textup{span} \left \{ v_5, v_6, v_7, v_8 \right \}}{ \textup{span} \left \{ v_5, v_6, v_7 \right \}} = \textup{span} \left \{ v_8 \right \}$

And the dimension of the homology group is 1, as desired.

## The General Algorithm

It is no coincidence that things worked out at nicely as they did. The process we took of simultaneously rewriting two matrices with respect to a common basis is the bulk of the algorithm to compute homology. Since we’re really only interested in the dimensions of the homology groups, we just need to count pivots. If the number of pivots arising in $\partial_k$ is $y$ and the number of pivots arising in $\partial_{k+1}$ is $z$, and the dimension of $C_k(X)$ is $n$, then the dimension is exactly

$(n-y) - z = \textup{dim}(\textup{ker} \partial_k) - \textup{dim}(\textup{im}\partial_{k+1})$

And it is no coincidence that the pivots lined up so nicely to allow us to count dimensions this way. It is a minor exercise to prove it formally, but the fact that the composition $\partial_k \partial_{k+1} = 0$ implies that the reduced version of $\partial_{k+1}$ will have an almost reduced row-echelon form (the only difference being the rows of zeros interspersed above, below, and possibly between pivot rows).

As the reader may have guessed at this point, we don’t actually need to compute $A$ and $A^{-1}$. Instead of this, we can perform the column/row reductions simultaneously on the two matrices. The above analysis helped us prove the algorithm works, and with that guarantee we can throw out the analytical baggage and just compute the damn thing.

Indeed, assuming the input is already processed as two matrices representing the boundary operators with respect to the standard bases of the chain groups, computing homology is only slightly more difficult than row reducing in the first place. Putting our homology where our mouth is, we’ve implemented the algorithm in Python. As usual, the entire code used in this post is available on this blog’s Google Code page.

The first step is writing auxiliary functions to do elementary row and column operations on matrices. For this post, we will do everything in numpy (which makes the syntax shorter than standard Python syntax, but dependent on the numpy library).

import numpy

def rowSwap(A, i, j):
temp = numpy.copy(A[i, :])
A[i, :] = A[j, :]
A[j, :] = temp

def colSwap(A, i, j):
temp = numpy.copy(A[:, i])
A[:, i] = A[:, j]
A[:, j] = temp

def scaleCol(A, i, c):
A[:, i] *= c*numpy.ones(A.shape[0])

def scaleRow(A, i, c):
A[i, :] *= c*numpy.ones(A.shape[1])

A[:, addTo] += scaleAmt * A[:, scaleCol]

A[addTo, :] += scaleAmt * A[scaleRow, :]


From here, the main meat of the algorithm is doing column reduction on one matrix, and applying the corresponding row operations on the other.

def simultaneousReduce(A, B):
if A.shape[1] != B.shape[0]:
raise Exception("Matrices have the wrong shape.")

numRows, numCols = A.shape # col reduce A

i,j = 0,0
while True:
if i >= numRows or j >= numCols:
break

if A[i][j] == 0:
nonzeroCol = j
while nonzeroCol < numCols and A[i,nonzeroCol] == 0:
nonzeroCol += 1

if nonzeroCol == numCols:
j += 1
continue

colSwap(A, j, nonzeroCol)
rowSwap(B, j, nonzeroCol)

pivot = A[i,j]
scaleCol(A, j, 1.0 / pivot)
scaleRow(B, j, 1.0 / pivot)

for otherCol in range(0, numCols):
if otherCol == j:
continue
if A[i, otherCol] != 0:
scaleAmt = -A[i, otherCol]
colCombine(A, otherCol, j, scaleAmt)
rowCombine(B, j, otherCol, -scaleAmt)

i += 1; j+= 1

return A,B


This more or less parallels the standard algorithm for row-reduction (with the caveat that all the indices are swapped to do column-reduction). The only somewhat confusing line is the call to rowCombine, which explicitly realizes the corresponding row operation as the inverse of the performed column operation. Note that for row operations, the correspondence between operations on the basis and operations on the rows is not as direct as it is for columns. What’s given above is the true correspondence. Writing down lots of examples will reveal why, and we leave that as an exercise to the reader.

Then the actual algorithm to compute homology is just a matter of counting pivots. Here are two pivot-counting functions in a typical numpy fashion:

def numPivotCols(A):
z = numpy.zeros(A.shape[0])
return [numpy.all(A[:, j] == z) for j in range(A.shape[1])].count(False)

def numPivotRows(A):
z = numpy.zeros(A.shape[1])
return [numpy.all(A[i, :] == z) for i in range(A.shape[0])].count(False)


And the final function is just:

def bettiNumber(d_k, d_kplus1):
A, B = numpy.copy(d_k), numpy.copy(d_kplus1)
simultaneousReduce(A, B)

dimKChains = A.shape[1]
kernelDim = dimKChains - numPivotCols(A)
imageDim = numPivotRows(B)

return kernelDim - imageDim


And there we have it! We’ve finally tackled the beast, and written a program to compute algebraic features of a topological space.

The reader may be curious as to why we didn’t come up with a more full-bodied representation of a simplicial complex and write an algorithm which accepts a simplicial complex and computes all of its homology groups. We’ll leave this direct approach as a (potentially long) exercise to the reader, because coming up in this series we are going to do one better. Instead of computing the homology groups of just one simplicial complex using by repeating one algorithm many times, we’re going to compute all the homology groups of a whole family of simplicial complexes in a single bound. This family of simplicial complexes will be constructed from a data set, and so, in grandiose words, we will compute the topological features of data.

If it sounds exciting, that’s because it is! We’ll be exploring a cutting-edge research field known as persistent homology, and we’ll see some of the applications of this theory to data analysis.

Until then!

# Probability Theory — A Primer

It is a wonder that we have yet to officially write about probability theory on this blog. Probability theory underlies a huge portion of artificial intelligence, machine learning, and statistics, and a number of our future posts will rely on the ideas and terminology we lay out in this post. Our first formal theory of machine learning will be deeply ingrained in probability theory, we will derive and analyze probabilistic learning algorithms, and our entire treatment of mathematical finance will be framed in terms of random variables.

And so it’s about time we got to the bottom of probability theory. In this post, we will begin with a naive version of probability theory. That is, everything will be finite and framed in terms of naive set theory without the aid of measure theory. This has the benefit of making the analysis and definitions simple. The downside is that we are restricted in what kinds of probability we are allowed to speak of. For instance, we aren’t allowed to work with probabilities defined on all real numbers. But for the majority of our purposes on this blog, this treatment will be enough. Indeed, most programming applications restrict infinite problems to finite subproblems or approximations (although in their analysis we often appeal to the infinite).

We should make a quick disclaimer before we get into the thick of things: this primer is not meant to connect probability theory to the real world. Indeed, to do so would be decidedly unmathematical. We are primarily concerned with the mathematical formalisms involved in the theory of probability, and we will leave the philosophical concerns and applications to  future posts. The point of this primer is simply to lay down the terminology and basic results needed to discuss such topics to begin with.

So let us begin with probability spaces and random variables.

## Finite Probability Spaces

We begin by defining probability as a set with an associated function. The intuitive idea is that the set consists of the outcomes of some experiment, and the function gives the probability of each event happening. For example, a set $\left \{ 0,1 \right \}$ might represent heads and tails outcomes of a coin flip, while the function assigns a probability of one half (or some other numbers) to the outcomes. As usual, this is just intuition and not rigorous mathematics. And so the following definition will lay out the necessary condition for this probability to make sense.

Definition: A finite set $\Omega$ equipped with a function $f: \Omega \to [0,1]$ is a probability space if the function $f$ satisfies the property

$\displaystyle \sum_{\omega \in \Omega} f(\omega) = 1$

That is, the sum of all the values of $f$ must be 1.

Sometimes the set $\Omega$ is called the sample space, and the act of choosing an element of $\Omega$ according to the probabilities given by $f$ is called drawing an example. The function $f$ is usually called the probability mass function. Despite being part of our first definition, the probability mass function is relatively useless except to build what follows. Because we don’t really care about the probability of a single outcome as much as we do the probability of an event.

Definition: An event $E \subset \Omega$ is a subset of a sample space.

For instance, suppose our probability space is $\Omega = \left \{ 1, 2, 3, 4, 5, 6 \right \}$ and $f$ is defined by setting $f(x) = 1/6$ for all $x \in \Omega$ (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset $\left \{ 2, 4, 6 \right \}$, and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

Of course, we want to extend the probability mass function $f$ (which is only defined on single outcomes) to all possible events of our probability space. That is, we want to define a probability measure $\textup{P}: 2^\Omega \to \mathbb{R}$, where $2^\Omega$ denotes the set of all subsets of $\Omega$. The example of a die roll guides our intuition: the probability of any event should be the sum of the probabilities of the outcomes contained in it. i.e. we define

$\displaystyle \textup{P}(E) = \sum_{e \in E} f(e)$

where by convention the empty sum has value zero. Note that the function $\textup{P}$ is often denoted $\textup{Pr}$.

So for example, the coin flip experiment can’t have zero probability for both of the two outcomes 0 and 1; the sum of the probabilities of all outcomes must sum to 1. More coherently: $\textup{P}(\Omega) = \sum_{\omega \in \Omega} f(\omega) = 1$ by the defining property of a probability space. And so if there are only two outcomes of the experiment, then they must have probabilities $p$ and $1-p$ for some $p$. Such a probability space is often called a Bernoulli trial.

Now that the function $\textup{P}$ is defined on all events, we can simplify our notation considerably. Because the probability mass function $f$ uniquely determines $\textup{P}$ and because $\textup{P}$ contains all information about $f$ in it ($\textup{P}(\left \{ \omega \right \}) = f(\omega)$), we may speak of $\textup{P}$ as the probability measure of $\Omega$, and leave $f$ out of the picture. Of course, when we define a probability measure, we will allow ourselves to just define the probability mass function and the definition of $\textup{P}$ is understood as above.

There are some other quick properties we can state or prove about probability measures: $\textup{P}(\left \{ \right \}) = 0$ by convention, if $E, F$ are disjoint then $\textup{P}(E \cup F) = \textup{P}(E) + \textup{P}(F)$, and if $E \subset F \subset \Omega$ then $\textup{P}(E) \leq \textup{P}(F)$. The proofs of these facts are trivial, but a good exercise for the uncomfortable reader to work out.

## Random Variables

The next definition is crucial to the entire theory. In general, we want to investigate many different kinds of random quantities on the same probability space. For instance, suppose we have the experiment of rolling two dice. The probability space would be

$\displaystyle \Omega = \left \{ (1,1), (1,2), (1,3), \dots, (6,4), (6,5), (6,6) \right \}$

Where the probability measure is defined uniformly by setting all single outcomes to have probability 1/36. Now this probability space is very general, but rarely are we interested only in its events. If this probability space were interpreted as part of a game of craps, we would likely be more interested in the sum of the two dice than the actual numbers on the dice. In fact, we are really more interested in the payoff determined by our roll.

Sums of numbers on dice are certainly predictable, but a payoff can conceivably be any function of the outcomes. In particular, it should be a function of $\Omega$ because all of the randomness inherent in the game comes from the generation of an output in $\Omega$ (otherwise we would define a different probability space to begin with).

And of course, we can compare these two different quantities (the amount of money and the sum of the two dice) within the framework of the same probability space. This “quantity” we speak of goes by the name of a random variable.

Definition: random variable $X$ is a real-valued function on the sample space $\Omega \to \mathbb{R}$.

So for example the random variable for the sum of the two dice would be $X(a,b) = a+b$. We will slowly phase out the function notation as we go, reverting to it when we need to avoid ambiguity.

We can further define the set of all random variables $\textup{RV}(\Omega)$. It is important to note that this forms a vector space. For those readers unfamiliar with linear algebra, the salient fact is that we can add two random variables together and multiply them by arbitrary constants, and the result is another random variable. That is, if $X, Y$ are two random variables, so is $aX + bY$ for real numbers $a, b$. This function operates linearly, in the sense that its value is $(aX + bY)(\omega) = aX(\omega) + bY(\omega)$. We will use this property quite heavily, because in most applications the analysis of a random variable begins by decomposing it into a combination of simpler random variables.

Of course, there are plenty of other things one can do to functions. For example, $XY$ is the product of two random variables (defined by $XY(\omega) = X(\omega)Y(\omega)$) and one can imagine such awkward constructions as $X/Y$ or $X^Y$. We will see in a bit why it these last two aren’t often used (it is difficult to say anything about them).

The simplest possible kind of random variable is one which identifies events as either occurring or not. That is, for an event $E$, we can define a random variable which is 0 or 1 depending on whether the input is a member of $E$. That is,

Definition: An indicator random variable $1_E$ is defined by setting $1_E(\omega) = 1$ when $\omega \in E$ and 0 otherwise. A common abuse of notation for singleton sets is to denote $1_{\left \{ \omega \right \} }$ by $1_\omega$.

This is what we intuitively do when we compute probabilities: to get a ten when rolling two dice, one can either get a six, a five, or a four on the first die, and then the second die must match it to add to ten.

The most important thing about breaking up random variables into simpler random variables will make itself clear when we see that expected value is a linear functional. That is, probabilistic computations of linear combinations of random variables can be computed by finding the values of the simpler pieces. We can’t yet make that rigorous though, because we don’t yet know what it means to speak of the probability of a random variable’s outcome.

Definition: Denote by $\left \{ X = k \right \}$ the set of outcomes $\omega \in \Omega$ for which $X(\omega) = k$. With the function notation, $\left \{ X = k \right \} = X^{-1}(k)$.

This definition extends to constructing ranges of outcomes of a random variable. i.e., we can define $\left \{ X < 5 \right \}$ or $\left \{ X \textup{ is even} \right \}$ just as we would naively construct sets. It works in general for any subset of $S \subset \mathbb{R}$. The notation is $\left \{ X \in S \right \} = X^{-1}(S)$, and we will also call these sets events. The notation becomes useful and elegant when we combine it with the probability measure $\textup{P}$. That is, we want to write things like $\textup{P}(X \textup{ is even})$ and read it in our head “the probability that $X$ is even”.

This is made rigorous by simply setting

$\displaystyle \textup{P}(X \in S) = \sum_{\omega \in X^{-1}(S)} \textup{P}(\omega)$

In words, it is just the sum of the probabilities that individual outcomes will have a value under $X$ that lands in $S$. We will also use for $\textup{P}(\left \{ X \in S \right \} \cap \left \{ Y \in T \right \})$ the shorthand notation $\textup{P}(X \in S, Y \in T)$ or $\textup{P}(X \in S \textup{ and } Y \in T)$.

Often times $\left \{ X \in S \right \}$ will be smaller than $\Omega$ itself, even if $S$ is large. For instance, let the probability space be the set of possible lottery numbers for one week’s draw of the lottery (with uniform probabilities), let $X$ be the profit function. Then $\textup{P}(X > 0)$ is very small indeed.

We should also note that because our probability spaces are finite, the image of the random variable $\textup{im}(X)$ is a finite subset of real numbers. In other words, the set of all events of the form $\left \{ X = x_i \right \}$ where $x_i \in \textup{im}(X)$ form a partition of $\Omega$. As such, we get the following immediate identity:

$\displaystyle 1 = \sum_{x_i \in \textup{im} (X)} P(X = x_i)$

The set of such events is called the probability distribution of the random variable $X$.

The final definition we will give in this section is that of independence. There are two separate but nearly identical notions of independence here. The first is that of two events. We say that two events $E,F \subset \Omega$ are independent if the probability of both $E, F$ occurring is the product of the probabilities of each event occurring. That is, $\textup{P}(E \cap F) = \textup{P}(E)\textup{P}(F)$. There are multiple ways to realize this formally, but without the aid of conditional probability (more on that next time) this is the easiest way. One should note that this is distinct from $E,F$ being disjoint as sets, because there may be a zero-probability outcome in both sets.

The second notion of independence is that of random variables. The definition is the same idea, but implemented using events of random variables instead of regular events. In particular, $X,Y$ are independent random variables if

$\displaystyle \textup{P}(X = x, Y = y) = \textup{P}(X=x)\textup{P}(Y=y)$

for all $x,y \in \mathbb{R}$.

## Expectation

We now turn to notions of expected value and variation, which form the cornerstone of the applications of probability theory.

Definition: Let $X$ be a random variable on a finite probability space $\Omega$. The expected value of $X$, denoted $\textup{E}(X)$, is the quantity

$\displaystyle \textup{E}(X) = \sum_{\omega \in \Omega} X(\omega) \textup{P}(\omega)$

Note that if we label the image of $X$ by $x_1, \dots, x_n$ then this is equivalent to

$\displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)$

The most important fact about expectation is that it is a linear functional on random variables. That is,

Theorem: If $X,Y$ are random variables on a finite probability space and $a,b \in \mathbb{R}$, then

$\displaystyle \textup{E}(aX + bY) = a\textup{E}(X) + b\textup{E}(Y)$

Proof. The only real step in the proof is to note that for each possible pair of values $x, y$ in the images of $X,Y$ resp., the events $E_{x,y} = \left \{ X = x, Y=y \right \}$ form a partition of the sample space $\Omega$. That is, because $aX + bY$ has a constant value on $E_{x,y}$, the second definition of expected value gives

$\displaystyle \textup{E}(aX + bY) = \sum_{x \in \textup{im} (X)} \sum_{y \in \textup{im} (Y)} (ax + by) \textup{P}(X = x, Y = y)$

and a little bit of algebraic elbow grease reduces this expression to $a\textup{E}(X) + b\textup{E}(Y)$. We leave this as an exercise to the reader, with the additional note that the sum $\sum_{y \in \textup{im}(Y)} \textup{P}(X = x, Y = y)$ is identical to $\textup{P}(X = x)$. $\square$

If we additionally know that $X,Y$ are independent random variables, then the same technique used above allows one to say something about the expectation of the product $\textup{E}(XY)$ (again by definition, $XY(\omega) = X(\omega)Y(\omega)$). In this case $\textup{E}(XY) = \textup{E}(X)\textup{E}(Y)$. We leave the proof as an exercise to the reader.

Now intuitively the expected value of a random variable is the “center” of the values assumed by the random variable. It is important, however, to note that the expected value need not be a value assumed by the random variable itself; that is, it might not be true that $\textup{E}(X) \in \textup{im}(X)$. For instance, in an experiment where we pick a number uniformly at random between 1 and 4 (the random variable is the identity function), the expected value would be:

$\displaystyle 1 \cdot \frac{1}{4} + 2 \cdot \frac{1}{4} + 3 \cdot \frac{1}{4} + 4 \cdot \frac{1}{4} = \frac{5}{2}$

But the random variable never achieves this value. Nevertheless, it would not make intuitive sense to call either 2 or 3 the “center” of the random variable (for both 2 and 3, there are two outcomes on one side and one on the other).

Let’s see a nice application of the linearity of expectation to a purely mathematical problem. The power of this example lies in the method: after a shrewd decomposition of a random variable $X$ into simpler (usually indicator) random variables, the computation of $\textup{E}(X)$ becomes trivial.

tournament  $T$ is a directed graph in which every pair of distinct vertices has exactly one edge between them (going one direction or the other). We can ask whether such a graph has a Hamiltonian path, that is, a path through the graph which visits each vertex exactly once. The datum of such a path is a list of numbers $(v_1, \dots, v_n)$, where we visit vertex $v_i$ at stage $i$ of the traversal. The condition for this to be a valid Hamiltonian path is that $(v_i, v_{i+1})$ is an edge in $T$ for all $i$.

Now if we construct a tournament on $n$ vertices by choosing the direction of each edges independently with equal probability 1/2, then we have a very nice probability space and we can ask what is the expected number of Hamiltonian paths. That is, $X$ is the random variable giving the number of Hamiltonian paths in such a randomly generated tournament, and we are interested in $\textup{E}(X)$.

To compute this, simply note that we can break $X = \sum_p X_p$, where $p$ ranges over all possible lists of the vertices. Then $\textup{E}(X) = \sum_p \textup{E}(X_p)$, and it suffices to compute the number of possible paths and the expected value of any given path. It isn’t hard to see the number of paths is $n!$ as this is the number of possible lists of $n$ items. Because each edge direction is chosen with probability 1/2 and they are all chosen independently of one another, the probability that any given path forms a Hamiltonian path depends on whether each edge was chosen with the correct orientation. That’s just

$\textup{P}(\textup{first edge and second edge and } \dots \textup{ and last edge})$

which by independence is

$\displaystyle \prod_{i = 1}^n \textup{P}(i^\textup{th} \textup{ edge is chosen}) = \frac{1}{2^{n-1}}$

That is, the expected number of Hamiltonian paths is $n!2^{-(n-1)}$.

## Variance and Covariance

Just as expectation is a measure of center, variance is a measure of spread. That is, variance measures how thinly distributed the values of a random variable $X$ are throughout the real line.

Definition: The variance of a random variable $X$ is the quantity $\textup{E}((X - \textup{E}(X))^2)$.

That is, $\textup{E}(X)$ is a number, and so $X - \textup{E}(X)$ is the random variable defined by $(X - \textup{E}(X))(\omega) = X(\omega) - \textup{E}(X)$. It is the expectation of the square of the deviation of $X$ from its expected value.

One often denotes the variance by $\textup{Var}(X)$ or $\sigma^2$. The square is for silly reasons: the standard deviation, denoted $\sigma$ and equivalent to $\sqrt{\textup{Var}(X)}$ has the same “units” as the outcomes of the experiment and so it’s preferred as the “base” frame of reference by some. We won’t bother with such physical nonsense here, but we will have to deal with the notation.

The variance operator has a few properties that make it quite different from expectation, but nonetheless fall our directly from the definition. We encourage the reader to prove a few:

• $\textup{Var}(X) = \textup{E}(X^2) - \textup{E}(X)^2$.
• $\textup{Var}(aX) = a^2\textup{Var}(X)$.
• When $X,Y$ are independent then variance is additive: $\textup{Var}(X+Y) = \textup{Var}(X) + \textup{Var}(Y)$.
• Variance is invariant under constant additives: $\textup{Var}(X+c) = \textup{Var}(X)$.

In addition, the quantity $\textup{Var}(aX + bY)$ is more complicated than one might first expect. In fact, to fully understand this quantity one must create a notion of correlation between two random variables. The formal name for this is covariance.

Definition: Let $X,Y$ be random variables. The covariance of $X$ and $Y$, denoted $\textup{Cov}(X,Y)$, is the quantity $\textup{E}((X - \textup{E}(X))(Y - \textup{E}(Y)))$.

Note the similarities between the variance definition and this one: if $X=Y$ then the two quantities coincide. That is, $\textup{Cov}(X,X) = \textup{Var}(X)$.

There is a nice interpretation to covariance that should accompany every treatment of probability: it measures the extent to which one random variable “follows” another. To make this rigorous, we need to derive a special property of the covariance.

Theorem: Let $X,Y$ be random variables with variances $\sigma_X^2, \sigma_Y^2$. Then their covariance is at most the product of the standard deviations in magnitude:

$|\textup{Cov}(X,Y)| \leq \sigma_X \sigma_Y$

Proof. Take any two non-constant random variables $X$ and $Y$ (we will replace these later with $X - \textup{E}(X), Y - \textup{E}(Y)$). Construct a new random variable $(tX + Y)^2$ where $t$ is a real variable and inspect its expected value. Because the function is squared, its values are all nonnegative, and hence its expected value is nonnegative. That is, $\textup{E}((tX + Y)^2)$. Expanding this and using linearity gives

$\displaystyle f(t) = t^2 \textup{E}(X^2) + 2t \textup{E}(XY) + \textup{E}(Y^2) \geq 0$

This is a quadratic function of a single variable $t$ which is nonnegative. From elementary algebra this means the discriminant is at most zero. i.e.

$\displaystyle 4 \textup{E}(XY)^2 - 4 \textup{E}(X^2) \textup{E}(Y^2) \leq 0$

and so dividing by 4 and replacing $X,Y$ with $X - \textup{E}(X), Y - \textup{E}(Y)$, resp., gives

$\textup{Cov}(X,Y)^2 \leq \sigma_X^2 \sigma_Y^2$

and the result follows. $\square$

Note that equality holds in the discriminant formula precisely when $Y = -tX$ (the discriminant is zero), and after the replacement this translates to $Y - \textup{E}(Y) = -t(X - \textup{E}(X))$ for some fixed value of $t$. In other words, for some real numbers $a,b$ we have $Y = aX + b$.

This has important consequences even in English: the covariance is maximized when $Y$ is a linear function of $X$, and otherwise is bounded from above and below. By dividing both sides of the inequality by $\sigma_X \sigma_Y$ we get the following definition:

Definition: The Pearson correlation coefficient of two random variables $X,Y$ is defined by

$\displaystyle r= \frac{\textup{Cov}(X,Y)}{\sigma_X \sigma_Y}$

If $r$ is close to 1, we call $X$ and $Y$ positively correlated. If $r$ is close to -1 we call them negatively correlated, and if $r$ is close to zero we call them uncorrelated.

The idea is that if two random variables are positively correlated, then a higher value for one variable (with respect to its expected value) corresponds to a higher value for the other. Likewise, negatively correlated variables have an inverse correspondence: a higher value for one correlates to a lower value for the other. The picture is as follows:

The  horizontal axis plots a sample of values of the random variable $X$ and the vertical plots a sample of $Y$. The linear correspondence is clear. Of course, all of this must be taken with a grain of salt: this correlation coefficient is only appropriate for analyzing random variables which have a linear correlation. There are plenty of interesting examples of random variables with non-linear correlation, and the Pearson correlation coefficient fails miserably at detecting them.

Here are some more examples of Pearson correlation coefficients applied to samples drawn from the sample spaces of various (continuous, but the issue still applies to the finite case) probability distributions:

Various examples of the Pearson correlation coefficient, credit Wikipedia.

Though we will not discuss it here, there is still a nice precedent for using the Pearson correlation coefficient. In one sense, the closer that the correlation coefficient is to 1, the better a linear predictor will perform in “guessing” values of $Y$ given values of $X$ (same goes for -1, but the predictor has negative slope).

But this strays a bit far from our original point: we still want to find a formula for $\textup{Var}(aX + bY)$. Expanding the definition, it is not hard to see that this amounts to the following proposition:

Proposition: The variance operator satisfies

$\displaystyle \textup{Var}(aX+bY) = a^2\textup{Var}(X) + b^2\textup{Var}(Y) + 2ab \textup{Cov}(X,Y)$

And using induction we get a general formula:

$\displaystyle \textup{Var} \left ( \sum_{i=1}^n a_i X_i \right ) = \sum_{i=1}^n \sum_{j = 1}^n a_i a_j \textup{Cov}(X_i,X_j)$

Note that in the general sum, we get a bunch of terms $\textup{Cov}(X_i,X_i) = \textup{Var}(X_i)$.

Another way to look at the linear relationships between a collection of random variables is via a covariance matrix.

Definition: The covariance matrix of a collection of random variables $X_1, \dots, X_n$ is the matrix whose $(i,j)$ entry is $\textup{Cov}(X_i,X_j)$.

As we have already seen on this blog in our post on eigenfaces, one can manipulate this matrix in interesting ways. In particular (and we may be busting out an unhealthy dose of new terminology here), the covariance matrix is symmetric and nonnegative, and so by the spectral theorem it has an orthonormal basis of eigenvectors, which allows us to diagonalize it. In more direct words: we can form a new collection of random variables $Y_j$ (which are linear combinations of the original variables $X_i$) such that the covariance of distinct pairs $Y_j, Y_k$ are all zero. In one sense, this is the “best perspective” with which to analyze the random variables. We gave a general algorithm to do this in our program gallery, and the technique is called principal component analysis.

## Next Up

So far in this primer we’ve seen a good chunk of the kinds of theorems one can prove in probability theory. Fortunately, much of what we’ve said for finite probability spaces holds for infinite (discrete) probability spaces and has natural analogues for continuous probability spaces.

Next time, we’ll investigate how things change for discrete probability spaces, and should we need it, we’ll follow that up with a primer on continuous probability. This will get our toes wet with some basic measure theory, but as every mathematician knows: analysis builds character.

Until then!

# Neural Networks and the Backpropagation Algorithm

## Neurons, as an Extension of the Perceptron Model

In a previous post in this series we investigated the Perceptron model for determining whether some data was linearly separable. That is, given a data set where the points are labelled in one of two classes, we were interested in finding a hyperplane that separates the classes. In the case of points in the plane, this just reduced to finding lines which separated the points like this:

A hyperplane (the slanted line) separating the blue data points (class -1) from the red data points (class +1)

As we saw last time, the Perceptron model is particularly bad at learning data. More accurately, the Perceptron model is very good at learning linearly separable data, but most kinds of data just happen to more complicated. Even with those disappointing results, there are two interesting generalizations of the Perceptron model that have exploded into huge fields of research. The two generalizations can roughly be described as

• Use a number of Perceptron models in some sort of conjunction.
• Use the Perceptron model on some non-linear transformation of the data.

The point of both of these is to introduce some sort of non-linearity into the decision boundary. The first generalization leads to the neural network, and the second leads to the support vector machine. Obviously this post will focus entirely on the first idea, but we plan to cover support vector machines in the near future. Recall further that the separating hyperplane was itself defined by a single vector (a normal vector to the plane) $\mathbf{w} = (w_1, \dots, w_n)$. To “decide” what class the new point $\mathbf{x}$ is in, we check the sign of an inner product with an added constant shifting term:

$\displaystyle f(\mathbf{x}) = \textup{sign}(\left \langle \mathbf{w}, \mathbf{x} \right \rangle + b) = \textup{sign} \left ( b + \sum_{i=1}^n w_i x_i \right )$

The class of a point is just the value of this function, and as we saw with the Perceptron this corresponds geometrically to which side of the hyperplane the point lies on. Now we can design a “neuron” based on this same formula. We consider a point $\mathbf{x} = (x_1, \dots, x_n)$ to be an input to the neuron, and the output will be the sign of the above sum for some coefficients $w_i$. In picture form it would look like this:

It is quite useful to literally think of this picture as a directed graph (see this blog’s gentle introduction to graph theory if you don’t know what a graph is). The edges corresponding to the coordinates of the input vector $\mathbf{x}$ have weights $w_i$, and the output edge corresponds to the sign of the linear combination. If we further enforce the inputs to be binary (that is, $\mathbf{x} \in \left \{ 0, 1 \right \}^n$), then we get a very nice biological interpretation of the system. If we think of the unit as a neuron, then the input edges correspond to nerve impulses, which can either be on or off (identically to an electrical circuit: there is high current or low current). The weights $w_i$ correspond to the strength of the neuronal connection. The neuron transmits or does not transmit a pulse as output depending on whether the inputs are strong enough.

We’re not quite done, though, because in this interpretation the output of the neuron will either fire or not fire. However, neurons in real life are somewhat more complicated. Specifically, neurons do not fire signals according to a discontinuous function. In addition, we want to use the usual tools from classical calculus to analyze our neuron, but we cannot do that unless the activation function is differentiable, and a prerequisite for that is to be continuous. In plain words, we need to allow our neurons to be able to “partially fire.” We need a small range at which the neuron ramps up quickly from not firing to firing, so that the activation function as a whole is differentiable.

This raises the obvious question: what function should we pick? It turns out that there are a number of possible functions we could use, ranging from polynomial to exponential in nature. But before we pick one in particular, let’s outline the qualities we want such a function to have.

Definition: A function $\sigma: \mathbb{R} \to [0,1]$ is an activation function if it satisfies the following properties:

• It has a first derivative $\sigma'$.
• $\sigma$ is non-decreasing, that is $\sigma'(x) \geq 0$ for all $x$
• $\sigma$ has horizontal asymptotes at both 0 and 1 (and as a consequence, $\lim_{x \to \infty} \sigma(x) = 1$, and $\lim_{x \to -\infty} \sigma(x) = 0$).
• $\sigma$ and $\sigma'$ are both computable functions.

With appropriate shifting and normalizing, there are a few reasonable (and time-tested) activation functions. The two main ones are the hyperbolic tangent $\tanh(x)$ and the sigmoid curve $1/ (1+e^{-t})$. They both look (more or less) like this:

A sigmoid function (source: Wikipedia)

And it is easy to see visually that this is what we want.

Withholding any discussion of why one would pick one specific activation over another, there is one more small detail. In the Perceptron model we allowed a “bias” $b$ which translated the separating hyperplane so that it need not pass through the origin, hence allowing a the set of all pairs $(\mathbf{w}, b)$ to represent every possible hyperplane. Perhaps the simplest way to incorporate the bias into this model is to add another input $x_0$ which is fixed to 1. Then we add a weight $w_0$, and it is easy to see that the constant $b$ can just be replaced with the weight $w_0$. In other words, the inner product $b \cdot 1 + \left \langle \mathbf{w}, \mathbf{x} \right \rangle$ is the same as the inner product of two new vectors $\mathbf{w}', \mathbf{x}'$ where we set $w'_0 = b$ and $x'_0 = 1$ and $w'_i = w_i, x'_i = x_i$ for all other $i$.

The updated picture is now:

Now the specification of a single neuron is complete:

Definition: neuron $N$ is a pair $(W, \sigma)$, where $W$ is a list of weights $(w_0, w_1, \dots, w_k)$, and $\sigma$ is an activation function. The impulse function of a neuron $N$, which we will denote $f_N: \left \{ 0,1 \right \}^{k+1} \to [0,1]$, is defined as

$\displaystyle f_N(\mathbf{x}) = \sigma(\left \langle \mathbf{w}, \mathbf{x} \right \rangle) = \sigma \left (\sum_{i=0}^k w_i x_i \right )$

We call $w_0$ the bias weight, and by convention the first input coordinate $x_0$ is fixed to 1 for all inputs $\mathbf{x}$.

(Since we always fix the first input to 1, $f_N$ is technically a function $\left \{ 0,1 \right \}^k \to \left \{ 0,1 \right \}$, but the reader will forgive us for blurring these details.)

## Combining Neurons into a Network

The question of how to “train” a single neuron is just a reformulation of the Perceptron problem. If we have a data set $\mathbf{x}_i$ with class labels $y_i = \pm 1$, we want to update the weights of a neuron so that the outputs agree with their class labels; that is, $f_N(\mathbf{x}_i) = y_i$ for all $i$. And we saw in the Perceptron how to do this: it’s fast and efficient, given that the data are linearly separable. And in fact training a neuron in this model (accounting for the new activation function) will give us identical decision functions as in the Perceptron model. All we have done so far is change our perspective from geometry to biology. But as we mentioned originally, we want to form a mathematical army of neurons, all working together to form a more powerful decision function.

The question is what form should this army take? Since we already thought of a single neuron as a graph, let’s generalize this. Instead of having a bunch of “input” vertices, a single “output” vertex, and one neuron doing the computation, we now have the same set of input vertices, the same output vertex, but now a number of intermediate neurons connected arbitrarily to each other. That is, the edges that are outputs of some neurons are connected to the inputs of other neurons, and the very last neuron’s output is the final output. We call such a construction a neural network.

For example, the following graph gives a neural network with 5 neurons

To compute the output of any neuron $N$, we need to compute the values of the impulse functions for each neuron whose output feeds into $N$. This in turn requires computing the values of the impulse functions for each of the inputs to those neurons, and so on. If we imagine electric current flowing through such a structure, we can view it as a kind of network flow problem, which is where the name “neural networks” comes from. This structure is also called a dependency graph, and (in the parlance of graph theory) a directed acyclic graph. Though nothing technical about these structures will show up in this particular post, we plan in the future to provide primers on their basic theories.

We remark that we view the above picture as a directed graph with the directed edges going upwards. And as in the picture, the incidence structure (which pairs of neurons are connected or not connected) of the graph is totally arbitrary, as long as it has no cycles. Note that this is in contrast to the classical idea of a neural network as “layered” with one or more intermediate layers, such that all neurons in neighboring layers are completely connected to one another.  Hence we will take a slightly more general approach in this post.

Now the question of training a network of interconnected neurons is significantly more complicated than that of training a single neuron. The algorithm to do so is called backpropagation, because we will check to see if the final output is an error, and if it is we will propagate the error backward through the network, updating weights as we go. But before we get there, let’s explore some motivation for the algorithm.

## The Backpropagation Algorithm – Single Neuron

Let us return to the case of a single neuron $N$ with weights $\mathbf{w} = (w_0, \dots , w_k)$ and an input $\mathbf{x} = (x_0, \dots, x_k)$. And momentarily, let us remove the activation function $\sigma$ from the picture (so that $f_N$ just computes the summation part). In this simplified world it is easy to take a given set of training inputs $\mathbf{x}_j$ with labels $y_j \in \left \{ 0,1 \right \}$ and compute the error of our neuron $N$ on the entire training set. A standard mathematical way to compute error is by sum of the squared deviations of our neuron’s output from the actual label.

$\displaystyle E(\mathbf{w}) = \sum_j (y_j - f_N(\mathbf{x}_j))^2$

The important part is that $E$ is a function just of the weights $w_i$. In other words, the set of weights completely specifies the behavior of a single neuron.

Enter calculus. Any time we have a multivariate function (here, each of the weights $w_i$ is a variable), then we can speak of its minima and maxima. In our case we strive to find a global minimum of the error function $E$, for then we would have learned our target classification function as well as possible. Indeed, to improve upon our current set of weights, we can use the standard gradient-descent algorithm. We have discussed versions of the gradient-descent algorithm on this blog before, as in our posts on decrypting substitution ciphers with n-grams and finding optimal stackings in Texas Hold ‘Em. We didn’t work with calculus there because the spaces involved were all discrete. But here we will eventually extend this error function to allow the inputs $x_i$ to be real-valued instead of binary, and so we need the full power of calculus. Luckily for the uninformed reader, the concept of gradient descent is the same in both cases. Since $E$ gives us a real number for each possible neuron (each choice of weights), we can take our current neuron and make it better it by changing the weights slightly, and ensuring our change gives us a smaller value under $E$. If we cannot ensure this, then we have reached a minimum.

Here are the details. For convenience we add a factor of $1/2$ to $E$ and drop the subscript $N$ from $f_N$. Since minimizing $E$ is the same as minimizing $\frac{1}{2}E$, this changes nothing about the minima of the function. That is, we will henceforth work with

$\displaystyle E(\mathbf{w}) = \frac{1}{2} \sum_j (y_j - f(\mathbf{x}_j))^2$

Then we compute the gradient $\nabla E$ of $E$. For fixed values of the variables $w_i$ (our current set of weights) this is a vector in $\mathbb{R}^n$, and as we know from calculus it points in the direction of steepest ascent of the function $E$. That is, if we subtract some sufficiently small multiple of this vector from our current weight vector, we will be closer to a minimum of the error function than we were before. If we were to add, we’d go toward a maximum.

Note that $E$ is never negative, and so it will have a global minimum value at or near 0 (if it is possible for the neuron to represent the target function perfectly, it will be zero). That is, our update rule should be

$\displaystyle \mathbf{w}_\textup{current} = \mathbf{w}_\textup{current} - \eta \nabla E(\mathbf{w}_\textup{current})$

where $\eta$ is some fixed parameter between 0 and 1 that represent the “learning rate.” We will not mention $\eta$ too much except to say that as long as it is sufficiently small and we allow ourselves enough time to learn, we are guaranteed to get a good approximation of some local minimum (though it might not be a global one).

With this update rule it suffices to compute $\nabla E$ explicitly.

$\displaystyle \nabla E = \left ( \frac{\partial E}{\partial w_0}, \dots, \frac{\partial E}{\partial w_n} \right )$

In each partial $\partial E / \partial w_i$ we consider each other variable beside $w_i$ to be constant, and combining this with the chain rule gives

$\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - f(\mathbf{x}_j)) \frac{\partial f}{\partial w_i}$

Since in the summation formula for $f$ the $w_i$ variable only shows up in the product $w_i x_{j,i}$ (where $x_{j,i}$ is the $i$-th term of the vector $\mathbf{x}_j$), the last part expands as $x_{j,i}$. i.e. we have

$\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - f(\mathbf{x}_j)) x_{j,i}$

Noting the negatives cancelling, this makes our update rule just

$\displaystyle w_i = w_i + \eta \sum_j (y_j - f(\mathbf{x}_j))x_{j,i}$

There is an alternative form of an update rule that allows one to update the weights after each individual input is tested (as opposed to checking the outputs of the entire training set). This is called the stochastic update rule, and is given identically as above but without summing over all $j$:

$\mathbf{w} = \mathbf{w} + \eta (y_j - f(\mathbf{x}_j)) \mathbf{x}_j$

For our purposes in this post, the stochastic and non-stochastic update rules will give identical results.

Adding in the activation function $\sigma$ is not hard, but we will choose our $\sigma$ so that it has particularly nice computability properties. Specifically, we will pick $\sigma = 1/(1+e^{-x})$ the sigmoid function, because it satisfies the identity

$\sigma'(x) = \sigma(x)(1-\sigma(x))$

So instead of $\partial f$ in the formula above we need $\partial (\sigma \circ f) / \partial w_i$, and this requires the chain rule once again:

$\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - \sigma(f(\mathbf{x}_j))) \sigma'(f(\mathbf{x}_j))x_{j,i}$

And using the identity for $\sigma'$ gives us

$\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - \sigma(f(\mathbf{x}_j))) \sigma(f(\mathbf{x}_j))(1-\sigma(f(\mathbf{x}_j)))x_{j,i}$

And a similar update rule as before. If we denote by $o_j$ the output value $\sigma(f(\mathbf{x}_j))$, then the stochastic version of this update rule is

$\mathbf{w} = \mathbf{w} + \eta o_j (1-o_j)(y_j - o_j)\mathbf{x}_j$

Now that we have motivated an update rule for a single neuron, let’s see how to apply this to an entire network of neurons.

## The Backpropagation Algorithm – Entire Network

There is a glaring problem in training a neural network using the update rule above. We don’t know what the “expected” output of any of the internal edges in the graph are. In order to compute the error we need to know what the correct output should be, but we don’t immediately have this information.

We don’t know the error value for a non-output node in the network.

In the picture above, we know the expected value of the edge leaving the node $N_2$, but not that of $N_1$. In order to compute the error for $N_1$, we need to derive some kind of error value for nodes in the middle of the network.

It seems reasonable that the error for $N_1$ should depend on the errors of the nodes for which $N_1$ provides an input. That is, in the following picture the error should come from all of the neurons $M_1, \dots M_n$.

In particular, one possible error value for a particular input to the entire network $\mathbf{x}$ would be a weighted sum over the errors of $M_i$, where the weights are the weights of the edges from $N_1$ to $M_i$. In other words, if $N_1$ has little effect on the output of one particular $M_i$, it shouldn’t assume too much responsibility for that error. That is, using the above picture, the error for $N_1$ (in terms of the input weights $v_i$) is

$\displaystyle \sum_i w_i E_{M_i}$

where $E_{M_i}$ is the error computed for the node $M_i$.

It turns out that there is a nice theoretical justification for using this quantity as well. In particular, if we think of the entire network as a single function, we can imagine the error $E(\mathbf{w})$ as being a very convoluted function of all the weights in the network. But no matter how confusing the function may be to write down, we know that it only involves addition, multiplication, and composition of differentiable functions. So if we want to know how to update the error with respect to a weight that is hidden very far down in the network, in theory it just requires enough applications of the chain rule to find it.

To see this, let’s say we have a nodes $N_{1,k}$ connected forward to nodes $N_{2,j}$ connected forward to nodes $N_{3,i}$, such that the weights $w_{k,j}$ represent weights going from $N_{1,k} \to N_{2,j}$, and weights $w_{j,i}$ are $N_{2,j} \to N_{3,i}$.

If we want to know the partial derivative of $E$ with respect to the deeply nested weight $w_{k,j}$, then we can just compute it:

$\displaystyle \frac{\partial E}{\partial w_{k,j}} = -\sum_i (y_i - o_i)\frac{\partial o_i}{\partial w_{k,j}}$

where $o_i = f_{N_{1,j}}(\dots) = \sigma(f(\dots))$ represents the value of the impulse function at each of the output neurons, in terms of a bunch of crazy summations we omit for clarity.

But after applying the chain rule, the partial of the inner summation $f = \sum_j w_{j,i}x_k$ only depends on $w_{k,j}$ via the coefficient $w_{j,i}$. i.e., the weight $w_{k,j}$ only affects node $N_{3,i}$ by the output of $N_{2,j}$ passing through the edge labeled $w_{j,i}$. So we get a sum

$\displaystyle \frac{\partial E}{\partial w_{k,j}} = -\sum_i (y_i - o_i)\sigma'(f(\dots)) w_{j,i}$

That is, it’s simply a weighted sum of the final errors $y_i - o_i$ by the right weights. The stuff inside the $f(\dots)$ is simply the output of that node, which is again a sum over its inputs. In stochastic form, this makes our update rule (for the weights of $N_j$) just

$\displaystyle \mathbf{w} = \mathbf{w} + \eta o_j(1-o_j) \left ( \sum_i w_{j,i} (y_i - o_i) \right ) \mathbf{z}$

where by $\mathbf{z}$ we denote the vector of inputs to the neuron in question (these may be the original input $\mathbf{x}$ if this neuron is the first in the network and all of the inputs are connected to it, or it may be the outputs of other neurons feeding into it).

The argument we gave only really holds for a network where there are only two edges from the input to the output.  But the reader who has mastered the art of juggling notation may easily generalize this via induction to prove it in general. This really is a sensible weight update for any neuron in the network.

And now that we have established our update rule, the backpropagation algorithm for training a neural network becomes relatively straightforward. Start by initializing the weights in the network at random. Evaluate an input $\mathbf{x}$ by feeding it forward through the network and recording at each internal node the output value $o_j$, and call the final output $o$. Then compute the error for that output value, propagate the error back to each of the nodes feeding into the output node, and update the weights for the output node using our update rule. Repeat this error propagation followed by a weight update for each of the nodes feeding into the output node in the same way, compute the updates for the nodes feeding into those nodes, and so on until the weights of the entire network are updated. Then repeat with a new input $\mathbf{x}$.

One minor issue is when to stop. Specifically, it won’t be the case that we only need to evaluate each input $\mathbf{x}$ exactly once. Depending on how the learning parameter $\eta$ is set, we may need to evaluate the entire training set many times! Indeed, we should only stop when the gradient for all of our examples is small, or we have run it for long enough to exhaust our patience. For simplicity we will ignore checking for a small gradient, and we will simply fix a number of iterations. We leave the gradient check as an exercise to the reader.

Then the result is a trained network, which we can further use to evaluate the labels for unknown inputs.

## Python Implementation

We now turn to implementing a neural network. As usual, all of the source code used in this post (and then some) is available on this blog’s Google code page.

The first thing we need to implement all of this is a data structure for a network. That is, we need to represent nodes and edges connecting nodes. Moreover each edge needs to have an associated value, and each node needs to store multiple values (the error that has been propagated back, and the output produced at that node). So we will represent this via two classes:

class Node:
def __init__(self):
self.lastOutput = None
self.lastInput = None
self.error = None
self.outgoingEdges = []
self.incomingEdges = []

class Edge:
def __init__(self, source, target):
self.weight = random.uniform(0,1)
self.source = source
self.target = target

# attach the edges to its nodes
source.outgoingEdges.append(self)
target.incomingEdges.append(self)


Then a neural network is represented by the set of input and output nodes.

class Network:
def __init__(self):
self.inputNodes = []
self.outputNode = None


In particular, each Node needs to know about its most recent output, input, and error in order to update its weights. So any time we evaluate some input, we need to store these values in the Node. We will progressively fill in these classes with the methods needed to evaluate and train the network on data. But before we can do anything, we need to be able to distinguish between an input node and a node internal to the network. For this we create a subclass of Node called InputNode:

class InputNode(Node):
def __init__(self, index):
Node.__init__(self)
self.index = index; # the index of the input vector corresponding to this node


And now here is a function which evaluates a given input when provided a neural network:

class Network:
...

def evaluate(self, inputVector):
return self.outputNode.evaluate(inputVector)

class Node:
...

def evaluate(self, inputVector):
self.lastInput = []
weightedSum = 0

for e in self.incomingEdges:
theInput = e.source.evaluate(inputVector)
self.lastInput.append(theInput)
weightedSum += e.weight * theInput

self.lastOutput = activationFunction(weightedSum)
return self.lastOutput

class InputNode(Node):
...

def evaluate(self, inputVector):
self.lastOutput = inputVector[self.index]
return self.lastOutput


A network calls the evaluate function on its output node, and each node recursively calls evaluate on the sources of each of its incoming edges. An InputNode simply returns the corresponding entry in the inputVector (which requires us to pass the input vector along through the recursive calls). Since our graph structure is arbitrary, we note that some nodes may be “evaluated” more than once per evaluation. As such, we need to store the node’s output for the duration of the evaluation. We also need to store this value for use in training, and so before a call to evaluate we must clear this value. We omit the details here for brevity.

We will use the evaluate() function for training as well as for evaluating unknown inputs. As usual, examples of using these classes (and tests) are available in the full source code on this blog’s Google code page.

In addition, we need to automatically add bias nodes and corresponding edges to the non-input nodes. This results in a new subclass of Node which has a default evaluate() value of 1. Because of the way we organized things, the existence of this class changes nothing about the training algorithm.

class BiasNode(InputNode):
def __init__(self):
Node.__init__(self)

def evaluate(self, inputVector):
return 1.0


We simply add a function to the Node class (which is overridden in the InputNode class) which adds a bias node and edge to every non-input node. The details are trivial; the reader may see them in the full source code.

The training algorithm will come in a loop consisting of three steps: first, evaluate an input example. Second, go through the network updating the error values of each node using backpropagation. Third, go through the network again to update the weights of the edges appropriately. This can be written as a very short function on a Network, which then requires a number of longer functions on the Node classes:

class Network:
...

def propagateError(self, label):
for node in self.inputNodes:
node.getError(label)

def updateWeights(self, learningRate):
for node in self.inputNodes:
node.updateWeights(learningRate)

def train(self, labeledExamples, learningRate=0.9, maxIterations=10000):
while maxIterations > 0:
for example, label in labeledExamples:
output = self.evaluate(example)
self.propagateError(label)
self.updateWeights(learningRate)

maxIterations -= 1


That is, the network class just makes the first call for each of these recursive processes. We then have the following functions on Nodes to implement getError and updateWeights:

class InputNode(Node):
...

def updateWeights(self, learningRate):
for edge in self.outgoingEdges:
edge.target.updateWeights(learningRate)

def getError(self, label):
for edge in self.outgoingEdges:
edge.target.getError(label)

class Node:
...

def getError(self, label):
if self.error is not None:
return self.error

if self.outgoingEdges == []: # this is an output node
self.error = label - self.lastOutput
else:
self.error = sum([edge.weight * edge.target.getError(label)
for edge in self.outgoingEdges])

return self.error

def updateWeights(self, learningRate):
if (self.error is not None and self.lastOutput is not None
and self.lastInput is not None):

for i, edge in enumerate(self.incomingEdges):
edge.weight += (learningRate * self.lastOutput * (1 - self.lastOutput) *
self.error * self.lastInput[i])

for edge in self.outgoingEdges:
edge.target.updateWeights(learningRate)

self.error = None
self.lastInput = None
self.lastOutput = None


These are simply the formulas we derived in the previous sections translated into code. The propagated error is computed as a weighted sum in getError(), and the previous input and output values were saved from the call to evaluate().

## A Sine Curve Example, and Issues

One simple example we can use to illustrate this is actually not a decision problem, per se, but a function estimation problem. In the course of all of this calculus, we implicitly allowed our neural network to output any values between 0 and 1 (indeed, the activation function did this for us). And so we can use a neural network to approximate any function which has values in $[0,1]$. In particular we will try this on

$\displaystyle f(x) = \frac{1}{2}(\sin(x) + 1)$

on the domain $[0,4 \pi ]$.

Our network is simple: we have a single layer of twenty neurons, each of which is connected to a single input neuron and a single output neuron. The learning rate is set to 0.25, the number of iterations is set to a hundred thousand, and the training set is randomly sampled from the domain.

After training (which takes around fifteen seconds), the average error (when tested against a new random sample) is between 0.03 and 0.06. Here is an example of one such output:

An example of a 20-node neural network approximating two periods of a sine function.

This picture hints at an important shortcoming of our algorithm. Note how the neural network’s approximation of the sine function does particularly poorly close to 0 and 1. This is not a coincidence, but rather a side effect of our activation function $\sigma$. In particular, because the sigmoid function achieves the values 0 and 1 only in the limit. That is, they never actually achieve 0 and 1, and in order to get close we require prohibitively large weights (which in turn correspond to rather large values to be fed to the activation function). One potential solution is to modify our sine function slightly more, by scaling it and translating it so that its values lie in $[0.1, 0.9]$, say. We leave this as an exercise to the reader.

As one might expect, the neural network also does better when we test it on a single period instead of two (since the sine function is less “complicated” on a single period). We also constructed a data set of binary numbers whose labels were 1 if the number was even and 0 if the number was odd. A similar layout to the sine example with three internal nodes again gave good results.

The issues arise on larger datasets. One big problem with training a neural network is that it’s near impossible to determine the “correct” structure of the network ahead of time. The success of our sine function example, for instance, depended much more than we anticipated on the number of nodes used. Of course, this also depends on the choice of learning rate and the number of iterations allowed, but the point is the same: the neural network is fraught with arbitrary choices. What’s worse is that it’s just as impossible to tell if your choices are justified. All you have is an empirical number to determine how well your network does on one training set, and inspecting the values of the various weights will tell you nothing in all but the most trivial of examples.

There are a number of researchers who have attempted to alleviate this problem in some way. One prominent example is the Cascade Correlation algorithm, which dynamically builds the network structure depending on the data. Other avenues include dynamically updating the learning rate and using a variety of other activation and error functions, based on information theory and game theory (adding penalties for various undesirable properties). Still other methods involve alternative weight updates based on more advanced optimization techniques (such as the conjugate gradient method). Part of the benefit of the backpropagation algorithm is that the choice of error function is irrelevant, as long as it is differentiable. This gives us a lot of flexibility to customize the neural network for our own application domain.

These sorts of questions are what have caused neural networks to become such a huge field of research in machine learning. As such, this blog post has only given the reader a small taste of what is out there. This is the bread and butter in a world of fine cuisine: it’s proven to be a solid choice, but it leaves a cornucopia of flavors unrealized and untapped.

The future of this machine learning series, however, will deviate from the neural network path. We will instead investigate the other extension of the Perceptron model, the Support Vector Machine. We will also lay down some formal theories of learning, because as of yet we have simply been exploring algorithms without the ability to give guarantees on their performance. In order to do that, we must needs formalize the notion of an algorithm “learning” a task. This is no small feat, and nobody has quite agreed on the best formalization. We will nevertheless explore these frameworks, and see what kinds of theorems we can prove in them.

Until then!

# K-Nearest-Neighbors and Handwritten Digit Classification

## The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

The Axiom of Neighborliness: Let $(X, d)$ be a metric space and let $S \subset X$ be a finite set whose elements are classified by some function $f : S \to \left \{ 1, 2, \dots, m \right \}$. We say that $S$ satisfies the axiom of neighborliness if for every point $x \in S$, if $y$ is the closest point to $x$, then $f(x) = f(y)$. That is, $y$ shares the same class as $x$ if $y$ is the nearest neighbor to $x$.

For a more in-depth discussion of metrics, the reader should refer to this blog’s primer on the topic. For the purpose of this post and all foreseeable posts, $X$ will always be $\mathbb{R}^n$ for some $n$, while the metric $d$ will vary.

This axiom is actually a very strong assumption which is certainly not true of every data set. In particular, it highly depends on the problem setup. Having the wrong kinds or the wrong number of features, doing an improper conversion, or using the wrong metric can all invalidate the assumption even if the problem inherently has the needed structure. Luckily, for real-world applications we only need the data to adhere to the axiom of neighborliness in approximation (indeed, in practice the axiom is only verifiable in approximation). Of course, what we mean by “approximation” also depends on the problem and the user’s tolerance for error. Such is the nature of applied mathematics.

Once we understand the axiom, the machine learning “algorithm” is essentially obvious. For training, store a number of data points whose classes are known and fix a metric. To determine the class of an unknown data point, simply use the most common class of its nearest neighbors. As one may vary (as a global parameter) the number of neighbors one considers, this method is intuitively called k-nearest-neighbors.

## The Most Basic Way to Learn: Copy Your Neighbors

Let’s iron out the details with a program and test it on some dummy data. Let’s construct a set of points in $\mathbb{R}^2$ which manifestly satisfies the axiom of neighborliness. To do this, we’ll use Python’s random library to make a dataset sampled from two independent normal distributions.

import random

def gaussCluster(center, stdDev, count=50):
return [(random.gauss(center[0], stdDev),
random.gauss(center[1], stdDev)) for _ in range(count)]

def makeDummyData():
return gaussCluster((-4,0), 1) + gaussCluster((4,0), 1)

The first function simply returns a cluster of points drawn from the specified normal distribution. For simplicity we equalize the covariance of the two random variables. The second function simply combines two clusters into a data set.

To give the dummy data class “labels,” we’ll simply have a second list that we keep alongside the data. The index of a data point in the first list corresponds to the index of its class label in the second. There are likely more elegant ways to organize this, but it suffices for now.

Implementing a metric is similarly straightforward. For now, we’ll use the standard Euclidean metric. That is, we simply take the sum of the squared differences of the coordinates of the given two points.

import math

def euclideanDistance(x,y):
return math.sqrt(sum([(a-b)**2 for (a,b) in zip(x,y)]))


To actually implement the classifier, we create a function which itself returns a function.

import heapq

def makeKNNClassifier(data, labels, k, distance):
def classify(x):
closestPoints = heapq.nsmallest(k, enumerate(data),
key=lambda y: distance(x, y[1]))
closestLabels = [labels[i] for (i, pt) in closestPoints]
return max(set(closestLabels), key=closestLabels.count)

return classify


There are a few tricky things going on in this function that deserve discussion. First and foremost, we are defining a function within another function, and returning the created function. The important technical point here is that the created function retains all local variables which are in scope even after the function ends. Specifically, you can call “makeKNNClassifier” multiple times with different arguments, and the returned functions won’t interfere with each other. One is said to close over the values in the environment, and so this programming language feature is called a function closure or just a closure, for short. It allows us, for instance, to keep important data visible while hiding any low-level data it depends on, but which we don’t access directly. From a high level, the decision function entirely represents the logic of the program, and so this view is justified.

Second, we are using some relatively Pythonic constructions. The first line of “classify” uses of heapq to pick the $k$ smallest elements of the data list, but in addition we use “enumerate” to preserve the index of the returned elements, and a custom key to have the judgement of “smallest” be determined by the custom distance function. Note that the indexed “y[1]” in the lambda function uses the point represented by “y” and not the saved index.

The second line simply extracts a list of the labels corresponding each of the closest points returned by the call to “nsmallest.” Finally, the third line returns the maximum of the given labels, where a label’s weight (determined by the poorly named “key” lambda) is its frequency in the “closestLabels” list.

Using these functions is quite simple:

trainingPoints = makeDummyData() # has 50 points from each class
trainingLabels = [1] * 50 + [2] * 50  # an arbitrary choice of labeling

f = makeKNNClassifier(trainingPoints, trainingLabels, 8, euclideanDistance)
print f((-3,0))


The reader may fiddle around with this example as desired, but we will not pursue it further. As usual, all code used in this post is available on this blog’s Google code page. Let’s move on to something more difficult.

## Handwritten Digits

One of the most classic examples in the classification literature is in recognizing handwritten digits. This originally showed up (as the legend goes) in the context of the United States Postal Service for the purpose of automatically sorting mail by the zip code of the destination. Although this author has no quantitative proof, the successful implementation of a scheme would likely save an enormous amount of labor and money. According to the Postal Facts site, there are 31,509 postal offices in the U.S. and, assuming each one processes mail, there is at least one employee at each office who would spend some time sorting by zip code. Given that the USPS processes 23 million pieces of mail per hour, a conservative estimate puts each office spending two hours of labor per day on sorting mail by zip code (resulting in a very rapid pace of 146.52 pieces of mail sorted per minute per worker). At a lower bound of $18/hr this amounts to a cost of$1,134,324 per day, or over 400 million dollars per year. Put in perspective, in one year the amount of money saved equals the entire two-year tuition of Moraine Valley Community College for 68,000 students (twice the current enrollment).

In short, the problem of sorting mail (and of classifying handwritten digits) begs to be automated, and indeed it has been to some degree for about four decades. Let’s see how k-nearest-neighbors fares in this realm.

We obtain our data from the UCI machine learning repository, and with a few minor modifications, we present it on this blog’s Google Code page (along with the rest of the code used in this post). A single line of the data file represents a handwritten digit and its label. The digit is a 256-element vector obtained by flattening a 16×16 binary-valued image in row-major order; the label is an integer representing the number in the picture. The data file contains 1593 instances with about 160 instances per digit.

In other words, our metric space is $\left \{ 0,1 \right \}^{256}$, and we choose the Euclidean metric for simplicity. With the line wrapping to better display the “image,” one line from the data file looks like:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0
1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0, 6

After reading in the data appropriately, we randomly split the data set into two pieces, train on one piece, and test on the other. The following function does this,  returning the success rate of the classification algorithm on the testing piece.

import knn
import random

def column(A, j):
return [row[j] for row in A]

def test(data, k):
random.shuffle(data)
pts, labels = column(data, 0), column(data, 1)

trainingData = pts[:800]
trainingLabels = labels[:800]
testData = pts[800:]
testLabels = labels[800:]

f = knn.makeKNNClassifier(trainingData, trainingLabels,
k, knn.euclideanDistance)
correct = 0
total = len(testLabels)

for (point, label) in zip(testData, testLabels):
if f(point) == label:
correct += 1

return float(correct) / total


A run with $k=1$ gives a surprisingly good 89% success rate. Varying $k$, we see this is about as good as it gets without any modifications to the algorithm or the metric. Indeed, the graph below shows that the handwritten digits data set agrees with the axiom of neighborliness to a fair approximation.

A graph of classification accuracy against k for values of k between 1 and 50. The graph clearly shows a downward trend as k increases, but all values k < 10 are comparably good.

Of course, there are many improvements we could make to this naive algorithm. But considering that it utilizes no domain knowledge and doesn’t manipulate the input data in any way, it’s not too shabby.

As a side note, it would be fun to get some tablet software and have it use this method to recognize numbers as one writes it. Alas, we have little time for these sorts of applications.

One reason k-nearest-neighbors is such a common and widely-known algorithm is its ease of implementation. Indeed, we implemented the core algorithm in a mere three lines of Python. On top of that, k-nearest-neighbors is pleasingly parallel, and inherently flexible. Unlike the Perceptron algorithm, which relies on linear separability, k-nearest-neighbors and the axiom of neighborliness allow for datasets with many different geometric structures. These lecture notes give a good example, as shown below, and the reader can surely conjure many more.

k-nearest-neighbors applied to a data set organized in concentric circles.

And of course, the flexibility is even greater by virtue of being able to use any metric for distance computations. One may, for instance, use the Manhattan metric if the points in question are locations in a city. Or if the data is sequential, one could use the dynamic time warping distance (which isn’t truly a metric, but is still useful). The possibilities are only limited by the discovery of new and useful metrics.

With such popularity, k-nearest-neighbors often comes with a number of modifications and enhancements. One enhancement is to heuristically remove certain points close to the decision boundary. This technique is called edited k-nearest-neighbors. Another is to weight certain features heavier in the distance computations, which requires one to programmatically determine which features help less with classification. This is getting close to the realm of a decision tree, and so we’ll leave this as an exercise to the reader.

The next improvement has to do with runtime. Given $n$ training points and $d$ features (d for dimension), one point requires $O(nd)$ to classify. This is particularly expensive, because most of the distance computations performed are between points that are far away, and as $k$ is usually small, they won’t influence in the classification.

One way to alleviate this is to store the data points in a data structure called a k-d tree. The k-d tree originated in computational geometry in the problem of point location. It partitions space into pieces based on the number of points in each resulting piece, and organizes the partitions into a tree. In other words, it will partition tightly where the points are dense, and loosely where the points are sparse. At each step of traversing the tree, one can check to see which sub-partition the unclassified point lies in, and descend appropriately. With certain guarantees, this reduces the computation to $O(\log(n)d)$. Unfortunately, there are issues with large-dimensional spaces that are beyond the scope of this post. We plan to investigate k-d trees further in a future series on computational geometry.

The last issue we consider is in data scaling. Specifically, one needs to be very careful when converting the real world data into numerical data. We can think of each of the features as a random variable, and we want all of these random variables to have comparable variation. The reason is simply because we’re using spheres. One can describe k-nearest-neighbors as finding the smallest (filled-in) sphere centered at the unlabeled point containing $k$ labeled data points, and using the most common of those labels to classify the new point. Of course, one can talk about “spheres” in any metric space; it’s just the set of all points within some fixed distance from the center (and this definition doesn’t depend on the dimension of the space). The important point is that a sphere has uniform length along every axis. If the data is scaled improperly, then the geometry of the sphere won’t mirror the geometry of the data, and the algorithm will flounder.

So now we’ve seen a smattering of topics about k-nearest-neighbors. We’d love to continue the discussion of modifications in the comments. Next time we’ll explore decision trees, and work with another data set. Until then!