# K-Nearest-Neighbors and Handwritten Digit Classification

## The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

The Axiom of Neighborliness: Let $(X, d)$ be a metric space and let $S \subset X$ be a finite set whose elements are classified by some function $f : S \to \left \{ 1, 2, \dots, m \right \}$. We say that $S$ satisfies the axiom of neighborliness if for every point $x \in S$, if $y$ is the closest point to $x$, then $f(x) = f(y)$. That is, $y$ shares the same class as $x$ if $y$ is the nearest neighbor to $x$.

For a more in-depth discussion of metrics, the reader should refer to this blog’s primer on the topic. For the purpose of this post and all foreseeable posts, $X$ will always be $\mathbb{R}^n$ for some $n$, while the metric $d$ will vary.

This axiom is actually a very strong assumption which is certainly not true of every data set. In particular, it highly depends on the problem setup. Having the wrong kinds or the wrong number of features, doing an improper conversion, or using the wrong metric can all invalidate the assumption even if the problem inherently has the needed structure. Luckily, for real-world applications we only need the data to adhere to the axiom of neighborliness in approximation (indeed, in practice the axiom is only verifiable in approximation). Of course, what we mean by “approximation” also depends on the problem and the user’s tolerance for error. Such is the nature of applied mathematics.

Once we understand the axiom, the machine learning “algorithm” is essentially obvious. For training, store a number of data points whose classes are known and fix a metric. To determine the class of an unknown data point, simply use the most common class of its nearest neighbors. As one may vary (as a global parameter) the number of neighbors one considers, this method is intuitively called k-nearest-neighbors.

## The Most Basic Way to Learn: Copy Your Neighbors

Let’s iron out the details with a program and test it on some dummy data. Let’s construct a set of points in $\mathbb{R}^2$ which manifestly satisfies the axiom of neighborliness. To do this, we’ll use Python’s random library to make a dataset sampled from two independent normal distributions.

import random

def gaussCluster(center, stdDev, count=50):
return [(random.gauss(center[0], stdDev),
random.gauss(center[1], stdDev)) for _ in range(count)]

def makeDummyData():
return gaussCluster((-4,0), 1) + gaussCluster((4,0), 1)

The first function simply returns a cluster of points drawn from the specified normal distribution. For simplicity we equalize the covariance of the two random variables. The second function simply combines two clusters into a data set.

To give the dummy data class “labels,” we’ll simply have a second list that we keep alongside the data. The index of a data point in the first list corresponds to the index of its class label in the second. There are likely more elegant ways to organize this, but it suffices for now.

Implementing a metric is similarly straightforward. For now, we’ll use the standard Euclidean metric. That is, we simply take the sum of the squared differences of the coordinates of the given two points.

import math

def euclideanDistance(x,y):
return math.sqrt(sum([(a-b)**2 for (a,b) in zip(x,y)]))


To actually implement the classifier, we create a function which itself returns a function.

import heapq

def makeKNNClassifier(data, labels, k, distance):
def classify(x):
closestPoints = heapq.nsmallest(k, enumerate(data),
key=lambda y: distance(x, y[1]))
closestLabels = [labels[i] for (i, pt) in closestPoints]
return max(set(closestLabels), key=closestLabels.count)

return classify


There are a few tricky things going on in this function that deserve discussion. First and foremost, we are defining a function within another function, and returning the created function. The important technical point here is that the created function retains all local variables which are in scope even after the function ends. Specifically, you can call “makeKNNClassifier” multiple times with different arguments, and the returned functions won’t interfere with each other. One is said to close over the values in the environment, and so this programming language feature is called a function closure or just a closure, for short. It allows us, for instance, to keep important data visible while hiding any low-level data it depends on, but which we don’t access directly. From a high level, the decision function entirely represents the logic of the program, and so this view is justified.

Second, we are using some relatively Pythonic constructions. The first line of “classify” uses of heapq to pick the $k$ smallest elements of the data list, but in addition we use “enumerate” to preserve the index of the returned elements, and a custom key to have the judgement of “smallest” be determined by the custom distance function. Note that the indexed “y[1]” in the lambda function uses the point represented by “y” and not the saved index.

The second line simply extracts a list of the labels corresponding each of the closest points returned by the call to “nsmallest.” Finally, the third line returns the maximum of the given labels, where a label’s weight (determined by the poorly named “key” lambda) is its frequency in the “closestLabels” list.

Using these functions is quite simple:

trainingPoints = makeDummyData() # has 50 points from each class
trainingLabels = [1] * 50 + [2] * 50  # an arbitrary choice of labeling

f = makeKNNClassifier(trainingPoints, trainingLabels, 8, euclideanDistance)
print f((-3,0))


The reader may fiddle around with this example as desired, but we will not pursue it further. As usual, all code used in this post is available on this blog’s Google code page. Let’s move on to something more difficult.

## Handwritten Digits

One of the most classic examples in the classification literature is in recognizing handwritten digits. This originally showed up (as the legend goes) in the context of the United States Postal Service for the purpose of automatically sorting mail by the zip code of the destination. Although this author has no quantitative proof, the successful implementation of a scheme would likely save an enormous amount of labor and money. According to the Postal Facts site, there are 31,509 postal offices in the U.S. and, assuming each one processes mail, there is at least one employee at each office who would spend some time sorting by zip code. Given that the USPS processes 23 million pieces of mail per hour, a conservative estimate puts each office spending two hours of labor per day on sorting mail by zip code (resulting in a very rapid pace of 146.52 pieces of mail sorted per minute per worker). At a lower bound of $18/hr this amounts to a cost of$1,134,324 per day, or over 400 million dollars per year. Put in perspective, in one year the amount of money saved equals the entire two-year tuition of Moraine Valley Community College for 68,000 students (twice the current enrollment).

In short, the problem of sorting mail (and of classifying handwritten digits) begs to be automated, and indeed it has been to some degree for about four decades. Let’s see how k-nearest-neighbors fares in this realm.

We obtain our data from the UCI machine learning repository, and with a few minor modifications, we present it on this blog’s Google Code page (along with the rest of the code used in this post). A single line of the data file represents a handwritten digit and its label. The digit is a 256-element vector obtained by flattening a 16×16 binary-valued image in row-major order; the label is an integer representing the number in the picture. The data file contains 1593 instances with about 160 instances per digit.

In other words, our metric space is $\left \{ 0,1 \right \}^{256}$, and we choose the Euclidean metric for simplicity. With the line wrapping to better display the “image,” one line from the data file looks like:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0
1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0, 6

After reading in the data appropriately, we randomly split the data set into two pieces, train on one piece, and test on the other. The following function does this,  returning the success rate of the classification algorithm on the testing piece.

import knn
import random

def column(A, j):
return [row[j] for row in A]

def test(data, k):
random.shuffle(data)
pts, labels = column(data, 0), column(data, 1)

trainingData = pts[:800]
trainingLabels = labels[:800]
testData = pts[800:]
testLabels = labels[800:]

f = knn.makeKNNClassifier(trainingData, trainingLabels,
k, knn.euclideanDistance)
correct = 0
total = len(testLabels)

for (point, label) in zip(testData, testLabels):
if f(point) == label:
correct += 1

return float(correct) / total


A run with $k=1$ gives a surprisingly good 89% success rate. Varying $k$, we see this is about as good as it gets without any modifications to the algorithm or the metric. Indeed, the graph below shows that the handwritten digits data set agrees with the axiom of neighborliness to a fair approximation.

A graph of classification accuracy against k for values of k between 1 and 50. The graph clearly shows a downward trend as k increases, but all values k < 10 are comparably good.

Of course, there are many improvements we could make to this naive algorithm. But considering that it utilizes no domain knowledge and doesn’t manipulate the input data in any way, it’s not too shabby.

As a side note, it would be fun to get some tablet software and have it use this method to recognize numbers as one writes it. Alas, we have little time for these sorts of applications.

One reason k-nearest-neighbors is such a common and widely-known algorithm is its ease of implementation. Indeed, we implemented the core algorithm in a mere three lines of Python. On top of that, k-nearest-neighbors is pleasingly parallel, and inherently flexible. Unlike the Perceptron algorithm, which relies on linear separability, k-nearest-neighbors and the axiom of neighborliness allow for datasets with many different geometric structures. These lecture notes give a good example, as shown below, and the reader can surely conjure many more.

k-nearest-neighbors applied to a data set organized in concentric circles.

And of course, the flexibility is even greater by virtue of being able to use any metric for distance computations. One may, for instance, use the Manhattan metric if the points in question are locations in a city. Or if the data is sequential, one could use the dynamic time warping distance (which isn’t truly a metric, but is still useful). The possibilities are only limited by the discovery of new and useful metrics.

With such popularity, k-nearest-neighbors often comes with a number of modifications and enhancements. One enhancement is to heuristically remove certain points close to the decision boundary. This technique is called edited k-nearest-neighbors. Another is to weight certain features heavier in the distance computations, which requires one to programmatically determine which features help less with classification. This is getting close to the realm of a decision tree, and so we’ll leave this as an exercise to the reader.

The next improvement has to do with runtime. Given $n$ training points and $d$ features (d for dimension), one point requires $O(nd)$ to classify. This is particularly expensive, because most of the distance computations performed are between points that are far away, and as $k$ is usually small, they won’t influence in the classification.

One way to alleviate this is to store the data points in a data structure called a k-d tree. The k-d tree originated in computational geometry in the problem of point location. It partitions space into pieces based on the number of points in each resulting piece, and organizes the partitions into a tree. In other words, it will partition tightly where the points are dense, and loosely where the points are sparse. At each step of traversing the tree, one can check to see which sub-partition the unclassified point lies in, and descend appropriately. With certain guarantees, this reduces the computation to $O(\log(n)d)$. Unfortunately, there are issues with large-dimensional spaces that are beyond the scope of this post. We plan to investigate k-d trees further in a future series on computational geometry.

The last issue we consider is in data scaling. Specifically, one needs to be very careful when converting the real world data into numerical data. We can think of each of the features as a random variable, and we want all of these random variables to have comparable variation. The reason is simply because we’re using spheres. One can describe k-nearest-neighbors as finding the smallest (filled-in) sphere centered at the unlabeled point containing $k$ labeled data points, and using the most common of those labels to classify the new point. Of course, one can talk about “spheres” in any metric space; it’s just the set of all points within some fixed distance from the center (and this definition doesn’t depend on the dimension of the space). The important point is that a sphere has uniform length along every axis. If the data is scaled improperly, then the geometry of the sphere won’t mirror the geometry of the data, and the algorithm will flounder.

So now we’ve seen a smattering of topics about k-nearest-neighbors. We’d love to continue the discussion of modifications in the comments. Next time we’ll explore decision trees, and work with another data set. Until then!

# Metric Spaces — A Primer

## The Blessing of Distance

We have often mentioned the idea of a “metric” on this blog, and we briefly described a formal definition for it. Colloquially, a metric is simply the mathematical notion of a distance function, with certain well-behaved properties. Since we’re now starting to cover a few more metrics (and things which are distinctly not metrics) in the context of machine learning algorithms, we find it pertinent to lay out the definition once again, discuss some implications, and explore a few basic examples.

The most important thing to take away from this discussion is that not all spaces have a notion of distance. For a space to have a metric is a strong property with far-reaching mathematical consequences. Essentially, metrics impose a topology on a space, which the reader can think of as the contortionist’s flavor of geometry. We’ll explore this idea after a few examples.

On the other hand, from a practical standpoint one can still do interesting things without a true metric. The downside is that work relying on (the various kinds of) non-metrics doesn’t benefit as greatly from existing mathematics. This can often spiral into empirical evaluation, where justifications and quantitative guarantees are not to be found.

## Metrics and Metric Spaces

Given a set $X$, we say $X$ is a metric space if it comes equipped with a special function $d(x,y)$ that can compute the distance between any two points $x,y$ of $X$. Specifically, $d$ must satisfy the axioms of a metric.

Definition: A function $d: X \times X \to \mathbb{R}$ is a metric if it satisfies the following three properties for any choice of elements $x, y, z \in X$.

• $d(x,y) \geq 0$ (non-negativity), and $d(x,y) = 0$ if and only if $x=y$.
• $d(x,y) = d(y,x)$ (symmetry)
• $d(x,y) + d(y,z) \geq d(x,z)$ (triangle inequality)

Our goal now is to convince the reader that these three axioms are sensible for every notion of distance to satisfy. The first bullet claims that the distance between any two things can never be negative (hence called “non-negativity”), and that the distance between two things can only be zero if those two things are actually the same thing. The second bullet is a matter of perspective; the distance function reads, “the distance between $x$ and $y$,” and this shouldn’t change based on which element comes first in the sentence. This is the “symmetry” condition.

Walking from one vertex to a neighboring vertex has an obvious fastest route. This obviousness motivates the triangle inequality as axiomatic.

If one wants to prove a function is a metric, the third bullet is often the hardest property to establish. It’s called the triangle inequality, and in words it says that the lengths of edges of triangles make sense if you measure them with $d$. Thinking of $x, y, z$ as the vertices of a triangle, such as the one at left, we don’t want the length of one edge to be longer than the combined lengths of the other two edges. It’s a basic fact of Euclidean geometry that such a triangle cannot be drawn.

Pedantically, we notice that the third bullet above uses $\geq$, which includes that $d(x,y) + d(y,z) = d(x,z)$. It is not hard to see that this occurs (in Euclidean space, at least) when $y$ lies on the line segment between $x$ and $z$. In this case it’s not truly a triangle, but it’s just convenient to pack it under the same name.

Aside from analyzing the abstract properties of a metric, the best way to understand this definition is to explore lots and lots of examples.

## Of Norms and Graphs and Levenshtein, of Taxicabs and Kings

The simplest metric one could construct is called the discrete metric. It is defined by $d(x,y) = 0$ if $x = y$ and $d(x,y) = 1$ otherwise. The symmetry and non-negativity conditions are trivially satisfied, and the triangle inequality is easy to prove. If $d(x,y) + d(y,z) < d(x,z) \leq 1$, then it must be that $x \neq z$, but both $x=y$ and $y=z$. The transitivity of equality, however, implies $x=z$, a contradiction.

The discrete metric is completely useless for practical purposes, because all it can do is tell one that two things are equal or not equal. We don’t need a metric to do this in real life. On the other hand, mathematically this metric has a lot of uses. It serves as a conveniently pathological counterexample allowing one to gauge the plausibility of purported theorems in topology. These sorts of things usually only show up in the realm of point-set topology, which we haven’t breached yet on this blog, so we’ll leave it as a relevant link for now (hit the first link to page 41).

The most well known metric by far is the Euclidean metric. In $n$ dimensions, this is just

$\displaystyle d((x_1, \dots, x_n), (y_1, \dots, y_n)) = \sqrt{(y_1 - x_1)^2 + \dots + (y_n - x_n)^2}$

The non-negativity and symmetry of this metric follow from the fact that $(a - b)^2 = (b - a)^2 \geq 0$ for all real numbers $a,b$. The triangle inequality is a bit more difficult to prove, and without using a cop-out like Minkowski’s inequality, one would need to prove the pythagorean theorem for $\mathbb{R}^n$, which implies the Cauchy-Schwarz inequality, which in turn implies the triangle inequality. Instead, we will do this at the end of this primer for a general vector space and a general inner product. The special case of the usual Euclidean dot product (which induces the Euclidean metric as above) will follow trivially.

The next metric we will inspect is the taxicab metric, also known as the Manhattan metric for the way it mimics driving distances on a grid of streets.

image source: Wikipedia

The picture to the right illustrates this: the green line represents usual Euclidean distance between the two black dots, while the blue, red, and yellow lines all represent the same distance via the taxicab metric. In particular, the distance is the sum of the lengths of the individual line segments, and it’s easy to see that the choice of path is irrelevant.

To make this more rigorous mathematically, we will pick the simplest possible path (the red one) to see that the distance is simply the sum of the differences of the x- and y-coordinates in absolute value. This generalizes to the following formula for arbitrary dimension.

$\displaystyle d((x_1, \dots, x_n), (y_1, \dots, y_n)) = |x_1 - y_1| + |x_2 - y_2| + \dots + |x_n - y_n|$

For reasons the measure-theorist is familiar with, this metric is sometimes called the $L_1$ metric. Much like the Euclidean metric, it also arises from a vector space (albeit not in the usual way). This function is non-negative and symmetric for the same reasons the Euclidean metric is. We will again defer the proof of the triangle inequality to the end of this post.

Next, we have the maximum metric, also known as the Chebyshev metric, which measures the distance it takes a king to travel from one point on a chessboard to another.

image source: chessguru.net

The picture to the left shows this in action. In particular, the distance required for a king to move from one square to another is exactly the maximum of the horizontal and vertical distances between the two squares.

If we generalized the chessboard and king’s movement rules to an arbitrary number of dimensions, it would result in taking the maximum of $|x_i - y_i|$ over each coordinate indexed by $i$.

Rigorously written, the maximum metric is defined by

$\displaystyle d(x,y) = \max_i (|x_i - y_i|)$

As usual, symmetry and non-negativity are completely obvious. The triangle inequality is not so hard here. If $z = x + y$, then

$\max_i |z_i| = \max_i |x_i + y_i| \leq \max_i (|x_i| + |y_i|) \leq \max_i |x_i| + \max_i |y_i|$.

This is the distance between $z$ and 0, and the more general result follows by translating the points in question (and it is easy to see that translation preserves distance).

Next, we can construct a metric on any undirected, weighted (or unweighted) graph $G$, and we naturally call it the graph metric. The space is the set of vertices in $G$, and the distance between two vertices is the shortest path between them, as per the weighting. In the case that $G$ is unweighted, we can equivalently count the number of edges in the shortest path (shortest by edge count) or assume all edge weights are equal to 1. By virtue of being undirected and weights being non-negative, the symmetry and non-negativity conditions are trivially satisfied. The triangle inequality is (unusually) trivial as well. For if the shortest path from $x$ to $z$ were longer than the paths from $x$ to $y$ and $y$ to $z$ for some other vertex $y$, then the latter is shorter than the former, a contradiction. For those readers familiar with group theory, this idea extends naturally to the metrics on groups, using the Cayley graph.

The last example is one we’ve explored at length on this blog, and that is the Levenshtein metric on words. Overly rigorously, it is a metric on a free monoid where we allow substitutions, insertions, and deletions. The reader is encouraged to read more about it in our post on metrics on words. An even simpler version of this is called the Hamming distance, where we only allow substitutions and the two words being compared must have the same length.

## A Few Vague Words on Topology

As we have mentioned briefly before on this blog, a notion of distance allows one to consider any geometric construction that relies only on distance. The easiest examples are circles and ellipses, but one can also talk about convergent sequences, and other more analytic ideas.

But more importantly to mathematicians, metrics generate very tractable topologies. While we’re not suited to begin a full-blown discussion of topology in this primer, we will at least say that a topology is simply a definition of open sets, subject to appropriate properties. For a metric, open sets are usually defined again by circles. That is, one might define the open sets to be all the unions and finite intersections of open disks, where an open disk is a set of the form $\left \{ y : d(x,y) < C \right \}$ for some center $x$, and some constant $C$.

The structure induced by these open sets is very flexible. In particular, two topological spaces are said to be equivalent if a function between them preserves open sets in both directions. This allows for all sorts of uncanny stretching and bending, such as those used to turn a sphere inside-out. The formal word for such a function is a homeomorphism, and the two spaces are said to be homeomorphic. One would be right to think that without certain assumptions, topologies could be wild and crazy beyond our imagination. The important point for this post is that a topology coming from a metric space is particularly well-behaved (at least as far as topologies go), satisfying a number of helpful properties for their analysis.

While it might seem weird and arbitrary to talk of open sets as a “structure” of a space, it turns out to yield a very rich theory and surprising applications. We plan to explore some of these applications on this blog in the distant future.

## Inner Product Spaces: a Reprise

In what follows we will give a detailed but elementary treatment of the triangle inequality for a general inner product space. One should note that there are even more general spaces that allow for metrics with the triangle inequality, but these usually involve measure theory or take the triangle inequality as an axiom. In this post, we want to see the triangle inequality occur as a result of the existence of an inner product. For a refresher on inner product spaces and the basic definitions, the reader should refer to our primer on the subject.

Before we continue, we should also note which inner products induce which metrics. For the Euclidean metric it is obviously the Euclidean inner product. For the taxicab metric one should refer to the second page of these notes.

Let $V$ be an inner product space, and let $v \in V$. We define the norm of $v$ to be $\| v \| = \sqrt{\left \langle v,v \right \rangle}$. This coincides with the usual Euclidean norm if we use the Euclidean inner product, and the $L_p$ norm if we use the appropriate integral inner product.

There are some trivial properties one would expect to be true of norms, such as non-negativity, $\| v \| = 0$ if and only if $v = 0$, and $\| av \| = |a| \| v \|$ for scalars $a$. We leave these as exercises to the reader.

As we noted in our primer on inner product spaces, two vectors $v,w$ are said to be orthogonal if $\left \langle v,w \right \rangle = 0$. From this we can prove the Pythagorean Theorem for an inner product space.

Theorem: If $u,v$ are orthogonal vectors, then $\| u + v \|^2 = \| u \|^2 + \| v \|^2$.

Proof. By definition, $\| u + v \|^2 = \left \langle u+v, u+v \right \rangle$, and this expands by linearity of the inner product to

$\displaystyle \|u \|^2 + \| v \|^2 + \left \langle u,v \right \rangle + \left \langle v,u \right \rangle$

As the two vectors are orthogonal, the right two terms are zero, giving the desired result. $\square$

Now given a vector $v$, we describe a useful way to decompose another vector $u$ into two parts, where one is orthogonal to $v$ and one is a scalar multiple of $v$. A simple computation gives a unique result:

$\displaystyle u = \frac{\left \langle u,v \right \rangle}{\| v \|^2}v + \left ( u - \frac{\left \langle u,v \right \rangle}{\|v \|^2}v \right )$

We call the first term the projection of $u$ onto $v$. The second term is then simply the remainder after subtracting off the projection. This construction helps us understand the relationship between two vectors, but it also helps us understand the relationship between the inner product and the norm, as in the following theorem.

Theorem: (The Cauchy-Schwarz Inequality). For all $u,v \in V$,

$\left \langle u,v \right \rangle \leq \|u \| \|v \|$.

Proof. If $v=0$ then the inequality trivially holds, so suppose $v \neq 0$. Consider the square norm of the orthogonal decomposition of $u$ onto $v$, where we denote the orthogonal part by $w$.

$\displaystyle u = \frac{\left \langle u,v \right \rangle}{\| v \|^2}v + w$

By the Pythagorean Theorem, we have

$\displaystyle \| u \|^2 = \left \| \frac{\left \langle u,v \right \rangle}{\| v \|^2}v \right \|^2 + \| w \|^2$

Since the scalar multiples pass squared through the norm, this is the same as

$\displaystyle \| u \|^2 = \frac{|\left \langle u,v \right \rangle |^2}{\| v \|^2} + \|w \|^2$

Since norms are non-negative, we can omit the $w$ part and get an inequality

$\displaystyle \| u \|^2 \geq \frac{| \left \langle u,v \right \rangle |^2}{\| v \|^2}$.

Multiplying both sides by $\| v \|^2$ and taking square roots gives the result. $\square$

And now we may finally get to the triangle inequality for norms, which says that $\| u + v \| \leq \| u \| + \| v \|$. Before we prove this, note that we can bring this back to the world of metrics by defining a metric based on the norm as $d(u,v) = \| u-v \|$, and the statement about the triangle inequality translates to what we expect it should. So concluding this primer we present the proof of the triangle inequality.

Theorem: For all $u,v \in V, \| u+v \| \leq \| u \| + \| v \|$.

Proof. Expanding $\| u + v \|^2$ using the properties of the inner product we get

$\displaystyle \| u + v \|^2 = \| u \|^2 + \| v \|^2 + 2 \textup{Re}\left \langle u,v \right \rangle$

Where “Re” stands for the real part of the (possibly complex-valued) inner product. As the real part bounded by the complex absolute value, we introduce our first inequality as

$\displaystyle \| u +v \|^2 \leq \| u \|^2 + \| v \|^2 + 2|\left \langle u,v \right \rangle|$

By the Cauchy-Schwarz inequality, the last term is bounded by the norms of $u$ and $v$, giving

$\displaystyle \| u+v \|^2 \leq \| u \|^2 + \| v \|^2 + 2 \| u \| \| v \| = (\| u \| + \| v \|)^2$

And taking square roots gives the result. $\square$

# Machine Learning — Introduction

## A Series on Machine Learning

These days an absolutely staggering amount of research and development work goes into the very coarsely defined field of “machine learning.” Part of the reason why it’s so coarsely defined is because it borrows techniques from so many different fields. Many problems in machine learning can be phrased in different but equivalent ways. While they are often purely optimization problems, such techniques can be expressed in terms of statistical inference, have biological interpretations, or have a distinctly geometric and topological flavor. As a result, machine learning has come to be understood as a toolbox of techniques as opposed to a unified theory.

It is unsurprising, then, that such a multitude of mathematics supports this diversified discipline. Practitioners (that is, algorithm designers) rely on statistical inference, linear algebra, convex optimization, and dabble in graph theory, functional analysis, and topology. Of course, above all else machine learning focuses on algorithms and data.

The general pattern, which we’ll see over and over again as we derive and implement various techniques, is to develop an algorithm or mathematical model, test it on datasets, and refine the model based on specific domain knowledge. The first step usually involves a leap of faith based on some mathematical intuition. The second step commonly involves a handful of established and well understood datasets (often taken from the University of California at Irvine’s machine learning database, and there is some controversy over how ubiquitous this practice is). The third step often seems to require some voodoo magic to tweak the algorithm and the dataset to complement one another.

It is this author’s personal belief that the most important part of machine learning is the mathematical foundation, followed closely by efficiency in implementation details. The thesis is that natural data has inherent structure, and that the goal of machine learning is to represent this and utilize it. To make true progress, one must represent and analyze structure abstractly. And so this blog will focus predominantly on mathematical underpinnings of the algorithms and the mathematical structure of data.

## General Plans

While we do intend to cover the classical topics of machine learning, such as neural networks and decision trees, we would like to quickly approach more sophisticated modern techniques such as support vector machines and methods based on Kolmogorov complexity. And so we put forth the ambitious list of topics (in no particular order).

• K nearest neighbors
• Decision trees
• Centroid-based and density-based clustering
• Neural networks
• Support vector machines
• Regression
• Bayesian inference and networks
• Methods based on Kolmogorov complexity
• Manifold learning and persistence homology

This long and circuitous journey will inevitably require arbitrarily large but finite detours to cover the mathematical background. We’ll cover metric spaces, functional analysis, mathematical statistics and probability theory, abstract algebra, topology, and even some category theory. Note that some of the more esoteric (i.e., advanced) topics will have their own series as well (for instance, we’ve had an itch to do computational category theory but having covered none of the typical concrete applications of category theory, the jump into extreme abstraction would come off as pointlessly complicated).

Of course, as we’ve mentioned before, while the mathematics is motivated by our desire to connect ideas, programming is motivated by what we can do. And so we’re interested in using machine learning methods to perform cool tasks. Some ideas we plan to implement on this blog include social network analysis, machine vision and optical character recognition, spam classification, natural language processing, speech recognition, and content classification and recommendation.

Finally, we are interested in the theoretical boundaries on what is possible for a computer to learn. Aside from its practical use, this area of study would require us to rigorously define what it means for a machine to “learn.” This field is known as computational learning theory, and a good deal of it is devoted to the typical complexity-theory-type questions such as “can this class of classification functions be learned in polynomial time?” In addition, this includes learning theories of a more statistical flavor, such as the “Probably Approximately Correct” model. We plan to investigate each of these models in turn as they come up in our explorations.

If any readers have suggestions for additional machine learning topics (to add to this already gargantuan list), feel free to pipe in with a comment! We’ll begin with an exploration of the simplest algorithm on the above list, k nearest neighbors, and a more rigorous exploration of metric spaces.

Until then!