Decision Trees and Political Party Classification

Last time we investigated the k-nearest-neighbors algorithm and the underlying idea that one can learn a classification rule by copying the known classification of nearby data points. This required that we view our data as sitting inside a metric space; that is, we imposed a kind of geometric structure on our data. One glaring problem is that there may be no reasonable way to do this. While we mentioned scaling issues and provided a number of possible metrics in our primer, a more common problem is that the data simply isn’t numeric.

For instance, a poll of US citizens might ask the respondent to select which of a number of issues he cares most about. There could be 50 choices, and there is no reasonable way to assign these numerical values so that all are equidistant in the resulting metric space.

Another issue is that the quality of the data could be bad. For instance, there may be missing values for some attributes (e.g., a respondent may neglect to answer one or more questions). Alternatively, the attributes or the classification label could be wrong; that is, the data could exhibit noise. For k-nearest-neighbors, missing an attribute means we can’t (or can’t accurately) compute the distance function. And noisy data can interfere with our choice of $ k$. In particular, certain regions might be better with a smaller value of $ k$, while regions with noisier data might require a larger $ k$ to achieve the same accuracy rate.  Without making the algorithm sufficiently more complicated to vary $ k$ when necessary, our classification accuracy will suffer.

In this post we’ll see how decision trees can alleviate these issues, and we’ll test the decision tree on an imperfect data set of congressional voting records. We’ll implement the algorithm in Python, and test it on the problem of predicting the US political party affiliation of members of Congress based on their votes for a number of resolutions. As usual, we post the entire code and data set on this blog’s Github page.

Before going on, the reader is encouraged to read our primer on trees. We will assume familiarity with the terminology defined there.


Imagine we have a data set where each record is a list of categorical weather conditions on a randomly selected number of days, and the labels correspond to whether a girl named Arya went for a horse ride on that day. Let’s also assume she would like to go for a ride every day, and the only thing that might prohibit her from doing so is adverse weather. In this case, the input variables will be the condition in the sky (sunny, cloudy, rainy, and snow), the temperature (cold, warm, and hot), the relative humidity (low, medium, and high), and the wind speed (low and high). The output variable will be whether Arya goes on a horse ride that day. Some entries in this data set might look like:

                 Arya's Riding Data
Sky     Temperature    Humidity    Wind    Horse Ride
Cloudy  Warm           Low         Low     Yes
Rainy   Cold           Medium      Low     No
Sunny   Warm           Medium      Low     Yes
Sunny   Hot            High        High    No
Snow    Cold           Low         High    No
Rainy   Warm           High        Low     Yes

In this case, one might reasonably guess that certain weather features are more important than others in determining whether Arya can go for a horse ride. For instance, the difference between sun and rain/snow should be a strong indicator, although it is not always correct in this data set. In other words, we’re looking for a weather feature that best separates the data into its respective classes. Of course, we’ll need a rigorous way to measure how good that separation is, but intuitively we can continue.

For example, we might split based on the wind speed feature. In this case, we have two smaller data sets corresponding to the entries where the wind is high and low. The corresponding table might look like:

     Arya's Riding Data, Wind = High
Sky     Temperature    Humidity    Horse Ride
Sunny   Hot            High        No
Snow    Cold           Low         No

     Arya's Riding Data, Wind = Low
Sky     Temperature    Humidity    Horse Ride
Cloudy  Warm           Low         Yes
Rainy   Cold           Medium      No
Sunny   Warm           Medium      Yes
Rainy   Warm           High        Yes

In this case, Arya is never known to ride a horse when the wind speed is high, and there is only one occasion when she doesn’t ride a horse and the wind speed is low. Taking this one step further, we can repeat the splitting process on the “Wind = Low” data in search of a complete split between the two output classes. We can see by visual inspection that the only “no” instance occurs when the temperature is cold. Hence, we should split on the temperature feature.

It is not useful to write out another set of tables (one feels the pain already when imagining a data set with a thousand entries), because in fact there is a better representation. The astute reader will have already recognized that our process of picking particular values for the weather features is just the process of traversing a tree.

Let’s investigate this idea closer. Imagine we have a tree where the root node corresponds to the Wind feature, and it has two edges connected to child nodes; one edge corresponds to the value “Low” and the other to “High.” That is, the process of traveling from the root to a child along an edge is the process of selecting only those data points whose “Wind” feature is that edge’s label. We can take the child corresponding to “Low” wind and have it represent the Temperature feature, further adding three child nodes with edges corresponding to the “Cold,” “Warm,” and “Hot” values.

We can stop this process once the choice of features completely splits our data set. Pictorially, our tree would look like this:

We reasonably decide to stop the traversal when all of the examples in the split are in the same class. More so, we would not want to include the option for the temperature to be Hot in the right subtree, because the data tells us nothing about such a scenario (as indicated by the “None” in the corresponding leaf).

Now that we have the data organized as a tree, we can try to classify new data with it. Suppose the new example is:

Sky     Temperature    Humidity    Wind    Horse Ride
Rainy   Cold           Low         Low     ?

We first inspect the wind speed feature, and seeing that it is “Low,” we follow the edge to the right subtree and repeat. Seeing that the temperature feature is “Cold,” we further descend down the “Cold” branch, reaching the “All No” leaf. Since this leaf corresponds to examples we’ve seen which are all in the “No” class, we should classify the new data as “No” as well.

Summarizing, given a new piece of data, we can traverse the tree according to the values of its features until we reach a leaf node. If the leaf node is “All No,” then we classify the new set of weather conditions as a “No,” and if it is “All Yes,” we classify as “Yes.”

Of course, this tree makes it clear that this toy data set is much too small to be useful, and the rules we’ve extrapolated from it are ridiculous. In particular, surely some people ride horses when the wind speed is high, and they would be unlikely to do so in a warm, low-wind thunderstorm. Nevertheless, we might expect a larger data set to yield a more sensible tree, as the data would more precisely reflect the true reasons one might refrain from riding a hose.

Before we generalize this example to any data set, we should note that there is yet another form for our classification rule. In particular, we can write the traversal from the root to the rightmost leaf as a boolean expression of the form:

$ \displaystyle \textup{Wind = “Low”} \wedge \textup{Temp = “Warm”}$

An example will be classified as “Yes” if and only if the wind feature is “High” and the temperature feature is “Warm” (here the wedge symbol $ \wedge$ means “and,” and is called a conjunction). If we had multiple such routes leading to leaves labeled with “All Yes,” say a branch for wind being “High” and sky being “Sunny,” we could expand our expression as a disjunction (an “or,” denoted $ \vee$) of the two corresponding conjunctions as follows:

 $ \displaystyle (\textup{Wind = “Low”} \wedge \textup{Temp = “Warm”}) \vee (\textup{Wind = “High”} \wedge \textup{Sky = “Sunny”})$

In the parlance of formal logic, this kind of expression is called the disjunctive normal form, that is, a disjunction of conjunctions. It’s an easy exercise to prove that every propositional statement (in our case, using only and, or, and parentheses for grouping) can be put into disjunctive normal form. That is, any boolean condition that can be used to classify the data can be expressed in a disjunctive normal form, and hence as a decision tree.

Such a “boolean condition” is an example of a hypothesis, which is the formal term for the rule an algorithm uses to classify new data points. We call the set of all possible hypotheses expressible by our algorithm the hypothesis space of our algorithm. What we’ve just shown above is that decision trees have a large and well-defined hypothesis space. On the other hand, it is much more difficult to describe the hypothesis space for an algorithm like k-nearest-neighbors. This is one argument in favor of decision trees: they have a well-understood hypothesis space, and that makes them analytically tractable and interpretable.

Using Entropy to Find Optimal Splits

The real problem here is not in using a decision tree, but in constructing one from data alone. At any step in the process we outlined in the example above, we need to determine which feature is the right one to split the data on. That is, we need to choose the labels for the interior nodes in so that the resulting data subsets are as homogeneous as possible. In particular, it would be nice to have a quantitative way to measure the quality of a split. Then at each step we could simply choose the feature whose split yields the highest value under this measurement.

While we won’t derive such a measurement in this post, we will use one that has an extensive history of applications: Shannon entropy.

Definition: Let $ D$ be a discrete probability distribution $ (p_1, p_2, \dots, p_n)$. Then the Shannon entropy of $ D$, denoted $ E(p_1, \dots, p_n)$ is

$ \displaystyle E(p_1, \dots , p_n) = – \sum_{i=1}^n p_i \log(p_i)$

Where the logarithms are taken in base 2.

In English, there are $ n$ possible outcomes numbered 1 to $ n$, and the probability that an instance drawn from $ D$ results in the outcome $ k$ is $ p_k$. Then Shannon’s entropy function computes a numerical quantity describing how “dispersed” the outcomes are.

While there are many other useful interpretations of Shannon entropy, we only need it to describe how well the data is split into its classes. For our purposes, the probability distribution will simply be the observed proportions of data with respect to their class labels. In the case of Arya’s horse riding, the initial distribution would be $ (1/2, 1/2)$, giving an entropy of $ 1$.

Let’s verify that Shannon’s entropy function makes sense for our problem. Specifically, the best scenario for splitting the data on a feature is a perfect split; that is, each subset only has data from one class. On the other hand, the worst case would be where each subset is uniformly distributed across all classes (if there are $ n$ classes, then each subset has $ 1/n$ of its data from each class).

Indeed, if we adopt the convention that $ \log(0) = 0$, then the entropy of $ (1,0, \dots, 0)$ consists of a single term $ -1 \log(1) = 0$. It is clear that this does not depend on the position of the 1 within the probability distribution. On the other hand, the entropy of $ (1/n, \dots, 1/n)$ is

$ \displaystyle -\sum_{i=1}^n\frac{1}{n}\log \left (\frac{1}{n} \right ) = -\log \left (\frac{1}{n} \right ) = -(0 – \log(n)) = \log(n)$

A well-known property of the entropy function tells us that this is in fact the maximum value for this function.

Summarizing this, in the best case entropy is minimized after the split, and in the worst case entropy is maximized. But we can’t simply look at the entropy of each subset after splitting. We need a sensible way to combine these entropies and to compare them with the entropy of the data before splitting. In particular, we would quantify the “decrease” in entropy caused by a split, and maximize that quantity.

Definition: Let $ S$ be a data set and $ A$ a feature with values $ v \in V$, and let $ E$ denote Shannon’s entropy function. Moreover, let $ S_v$ denote the subset of $ S$ for which the feature $ A$ has the value $ v$. The gain of a split along the feature $ A$, denoted $ G(S,A)$ is

$ \displaystyle G(S,A) = E(S) – \sum_{v \in V} \frac{|S_v|}{|S|} E(S_v)$

That is, we are taking the difference of the entropy before the split, and subtracting off the entropies of each part after splitting, with an appropriate weight depending on the size of each piece. Indeed, if the entropy grows after the split (that is if the data becomes more mixed), then this number will be small. On the other hand if the split separates the classes nicely, each subset $ S_v$ will have small entropy, and hence the value will be large.

It requires a bit of mathematical tinkering to be completely comfortable that this function actually does what we want it to (for instance, it is not obvious that this function is non-negative; does it make sense to have a negative gain?). We won’t tarry in those details (this author has spent at least a day or two ironing them out), but we can rest assured that this function has been studied extensively, and nothing unexpected happens.

So now the algorithm for building trees is apparent: at each stage, simply pick the feature for which the gain function is maximized, and split the data on that feature. Create a child node for each of the subsets in the split, and connect them via edges with labels corresponding to the chosen feature value for that piece.

This algorithm is classically called ID3, and we’ll implement it in the next section.

Building a Decision Tree in Python

As with our primer on trees, we can use a quite simple data structure to represent the tree, but here we need a few extra pieces of data associated with each node.

class Tree:
   def __init__(self, parent=None):
      self.parent = parent
      self.children = []
      self.splitFeature = None
      self.splitFeatureValue = None
      self.label = None

In particular, now that features can have more than two possible values, we need to allow for an arbitrarily long list of child nodes. In addition, we add three pieces of data (with default values None): the splitFeature is the feature for which each of its children assumes a separate value; the splitFeatureValue is the feature assumed for its parent’s split; and the label (which is None for all interior nodes) is the final classification label for a leaf.

We also need to nail down our representations for the data. In particular, we will represent a data set as a list of pairs of the form (point, label), where the point is itself a list of the feature values, and the label is a string.

Now given a data set the first thing we need to do is compute its entropy. For that we can first convert it to a distribution (in the sense defined above, a list of probabilities which sum to 1):

def dataToDistribution(data):
   ''' Turn a dataset which has n possible classification labels into a
       probability distribution with n entries. '''
   allLabels = [label for (point, label) in data]
   numEntries = len(allLabels)
   possibleLabels = set(allLabels)

   dist = []
   for aLabel in possibleLabels:
      dist.append(float(allLabels.count(aLabel)) / numEntries)

   return dist

And we can compute the entropy of such a distribution in the obvious way:

def entropy(dist):
   ''' Compute the Shannon entropy of the given probability distribution. '''
   return -sum([p * math.log(p, 2) for p in dist])

Now in order to compute the gain of a data set by splitting on a particular value, we need to be able to split the data set. To do this, we identify features with their index in the list of feature values of a given data point, enumerate all possible values of that feature, and generate the needed subsets one at a time. In particular, we use a Python generator object:

def splitData(data, featureIndex):
   ''' Iterate over the subsets of data corresponding to each value
       of the feature at the index featureIndex. '''

   # get possible values of the given feature
   attrValues = [point[featureIndex] for (point, label) in data]

   for aValue in set(attrValues):
      dataSubset = [(point, label) for (point, label) in data
                    if point[featureIndex] == aValue]

      yield dataSubset

So to compute the gain, we simply need to iterate over the set of all splits, and compute the entropy of each split. In code:

def gain(data, featureIndex):
   ''' Compute the expected gain from splitting the data along all possible
       values of feature. '''

   entropyGain = entropy(dataToDistribution(data))

   for dataSubset in splitData(data, featureIndex):
      entropyGain -= entropy(dataToDistribution(dataSubset))

   return entropyGain

Of course, the best split (represented as the best feature to split on) is given by such a line of code as:

bestFeature = max(range(n), key=lambda index: gain(data, index))

We can’t quite use this line exactly though, because while we’re building up the decision tree (which will of course be a recursive process) we need to keep track of which features have been split on previously and which have not; this data is different for each possible traversal of the tree. In the end, our function to build a decision tree requires three pieces of data: the current subset of the data to investigate, the root of the current subtree that we are in the process of building, and the set of features we have yet to split on.

Of course, this raises the obvious question about the base cases. One base case will be when we run out of data to split; that is, when our input data all have the same classification label. To check for this we implement a function called “homogeneous”

def homogeneous(data):
   ''' Return True if the data have the same label, and False otherwise. '''
   return len(set([label for (point, label) in data])) <= 1

The other base case is when we run out of good features to split on. Of course, if the true classification function is actually a decision tree then this won’t be the case. But now that we’re in the real world, we can imagine there may be two data points with identical features but different classes. Perhaps the simplest way to remedy this situation is to terminate the tree at that point (when we run out of features to split on, or no split gives positive gain), and use a simple majority vote to label the new leaf. In a sense, this strategy is a sort of nearest-neighbors vote as a default. To implement this, we have a function which simply patches up the leaf appropriately:

def majorityVote(data, node):
   ''' Label node with the majority of the class labels in the given data set. '''
   labels = [label for (pt, label) in data]
   choice = max(set(labels), key=labels.count)
   node.label = choice
   return node

The base cases show up rather plainly in the code to follow, so let us instead focus on the inductive step. We declare our function to accept the data set in question, the root of the subtree to be built, and a list of the remaining allowable features to split on. The function begins with:

def buildDecisionTree(data, root, remainingFeatures):
   ''' Build a decision tree from the given data, appending the children
       to the given root node (which may be the root of a subtree). '''

   if homogeneous(data):
      root.label = data[0][1]
      return root

   if len(remainingFeatures) == 0:
      return majorityVote(data, root)

   # find the index of the best feature to split on
   bestFeature = max(remainingFeatures, key=lambda index: gain(data, index))

   if gain(data, bestFeature) == 0:
      return majorityVote(data, root)

   root.splitFeature = bestFeature

Here we see the base cases, and the selection of the best feature to split on. As a side remark, we observe this is not the most efficient implementation. We admittedly call the gain function and splitData functions more often than necessary, but we feel what is lost in runtime speed is gained in code legibility.

Once we bypass the three base cases, and we have determined the right split, we just do it:

def buildDecisionTree(data, root, remainingFeatures):
   ''' Build a decision tree from the given data, appending the children
       to the given root node (which may be the root of a subtree). '''


   # add child nodes and process recursively
   for dataSubset in splitData(data, bestFeature):
      aChild = Tree(parent=root)
      aChild.splitFeatureValue = dataSubset[0][0][bestFeature]

      buildDecisionTree(dataSubset, aChild, remainingFeatures - set([bestFeature]))

   return root

Here we iterate over the subsets of data after the split, and create a child node for each. We then assign the child its corresponding feature value in the splitFeatureValue variable, and append the child to the root’s list of children. Next is where we first see the remainingFeatures set come into play, and in particular we note the overloaded minus sign as an operation on sets. This is a feature of python sets, and in particular it behaves exactly like set exclusion in mathematics. The astute programmer will note that the minus operation generates a new set, so that further recursive calls to buildDecisionTree will not be affected by what happens in this recursive call.

Now the first call to this function requires some initial parameter setup, so we define a convenience function that only requires a single argument: the data.

def decisionTree(data):
   return buildDecisionTree(data, Tree(), set(range(len(data[0][0]))))

Classifying New Data

The last piece of the puzzle is to classify a new piece of data once we’ve constructed the decision tree. This is a considerably simpler recursive process. If the current node is a leaf, output its label. Otherwise, recursively search the subtree (the child of the current node) whose splitFeatureValue matches the new data’s choice of the feature being split. In code,

def classify(tree, point):
   ''' Classify a data point by traversing the given decision tree. '''

   if tree.children == []:
      return tree.label
      matchingChildren = [child for child in tree.children
         if child.splitFeatureValue == point[tree.splitFeature]]

      return classify(matchingChildren[0], point)

And we can use this function to naturally test a dataset:

def testClassification(data, tree):
   actualLabels = [label for point, label in data]
   predictedLabels = [classify(tree, point) for point, label in data]

   correctLabels = [(1 if a == b else 0) for a,b in zip(actualLabels, predictedLabels)]
   return float(sum(correctLabels)) / len(actualLabels)

But now we run into the issue of noisy data. What if one wants to classify a point where one of the feature values which is used in the tree is unknown? One can take many approaches to remedy this, and we choose a simple one: simply search both routes, and use a majority vote when reaching a leaf. This requires us to add one additional piece of information to the leaf nodes: the total number of labels in each class used to build that leaf (recall, one of our stopping conditions resulted in a leaf having heterogeneous data). We omit the details here, but the reader is invited to read them on this blog’s Github page, where as usual we have provided all of the code used in this post.

Classifying Political Parties Based on Congressional Votes

We now move to a concrete application of decision trees. The data set we will work with comes from the UCI machine learning repository, and it records the votes cast by the US House of Representatives during a particular session of Congress in 1984. The data set has 16 features; that is, there were 16 different measures considered “key” measures that were vote upon during this session. So each point in the dataset represents the 16 votes of a single House member in that session. With a bit of reformatting, we provide the complete data set on this blog’s Github page.

Our goal is to learn party membership based on the voting records. This data set is rife with missing values; roughly half of the members abstained from voting on some of these measures. So we constructed a decision tree from the clean portion of the data, and use that to classify the remainder of the data.

Indeed, this data fits precisely into the algorithm we designed above. The code to construct a tree is almost trivial:

   with open('house-votes-1984.txt', 'r') as inputFile:
      lines = inputFile.readlines()

   data = [line.strip().split(',') for line in lines]
   data = [(x[1:], x[0]) for x in data]

   cleanData = [x for x in data if '?' not in x[0]]
   noisyData = [x for x in data if '?' in x[0]]

   tree = decisionTree(cleanData)

Indeed, the classification accuracy in doing this is around 90%. In addition (though we will revisit the concept of overfitting later), this is stable despite variation in the size of the subset of data used to build the tree. The graph below shows this.

The size of the subset used to construct the tree versus its accuracy in classifying the remainder of the data. Note that the subsets were chosen uniformly at random without replacement. The x-axis is the number of points used to construct the tree, and the y-axis is the proportion of labels correctly classified.

Inspecting the trees generated in this process, it appears that the most prominent feature to split on is the adoption of a new budget resolution. Very few Demorats voted in favor of this, so for many of the random subsets of the data, a split on this feature left one side homogeneously Republican.

Overfitting, Pruning, and Other Issues

Now there are some obvious shortcomings to the method in general. If the data set used to build the decision tree is enormous (in dimension or in number of points), then the resulting decision tree can be arbitrarily large in size. In addition, there is the pitfall of overfitting to this particular data set. For the party classification problem above, the point is to extend the classification to any population of people who might vote on these issues (or, more narrowly, to any politician who might vote on these issues). If we make our decision tree very large, then the hypothesis may be overly specific to the people in the sample used, and hence will not generalize well.

This problem is called overfitting to the data, and it’s a prevalent concern among all machine learning algorithms. There are a number of ways to avoid it for decision trees. Perhaps the most common is the idea of pruning: one temporarily removes all possible proper subtrees and reevaluates the classification accuracy for that removal. Whichever subtree results in the greatest increase in accuracy is actually removed, and it is replaced with a single leaf whose label corresponds to the majority label of the data points used to create the entire subtree. This process is then repeated until there are no possible improvements, or the gain is sufficiently small.

From a statistical point of view one could say this process is that of ignoring outliers. Any points which do not generally agree with the whole trend of the data set (hence, create their own branches in the decision tree) are simply removed. From a theoretical point of a view, a smaller decision tree satisfies the principle of Occam’s razor: a simpler hypothesis is more accurate by virtue of being simple.

While we won’t implement a pruning method here (indeed, we didn’t detect any overfitting in the congressional voting example), but it would be a nice exercise for the reader to wet his feet with the code given above. Finally, there are other algorithms to build decision trees that we haven’t mentioned here. You can see a list of such algorithms on the relevant wikipedia page. Because of the success of ID3, there is a large body of research on such algorithms.

In any event, next time we’ll investigate yet another machine learning method: that of neural networks. We’ll also start to look at more general frameworks for computational learning theory. That is, we’ll exercise the full might of theoretical mathematics to reason about how hard certain problems are to learn (or whether they can be learned at all).

Until then!

K-Nearest-Neighbors and Handwritten Digit Classification

The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

The Axiom of Neighborliness: Let $ (X, d)$ be a metric space and let $ S \subset X$ be a finite set whose elements are classified by some function $ f : S \to \left \{ 1, 2, \dots, m \right \}$. We say that $ S$ satisfies the axiom of neighborliness if for every point $ x \in S$, if $ y$ is the closest point to $ x$, then $ f(x) = f(y)$. That is, $ y$ shares the same class as $ x$ if $ y$ is the nearest neighbor to $ x$.

For a more in-depth discussion of metrics, the reader should refer to this blog’s primer on the topic. For the purpose of this post and all foreseeable posts, $ X$ will always be $ \mathbb{R}^n$ for some $ n$, while the metric $ d$ will vary.

This axiom is actually a very strong assumption which is certainly not true of every data set. In particular, it highly depends on the problem setup. Having the wrong kinds or the wrong number of features, doing an improper conversion, or using the wrong metric can all invalidate the assumption even if the problem inherently has the needed structure. Luckily, for real-world applications we only need the data to adhere to the axiom of neighborliness in approximation (indeed, in practice the axiom is only verifiable in approximation). Of course, what we mean by “approximation” also depends on the problem and the user’s tolerance for error. Such is the nature of applied mathematics.

Once we understand the axiom, the machine learning “algorithm” is essentially obvious. For training, store a number of data points whose classes are known and fix a metric. To determine the class of an unknown data point, simply use the most common class of its nearest neighbors. As one may vary (as a global parameter) the number of neighbors one considers, this method is intuitively called k-nearest-neighbors.

The Most Basic Way to Learn: Copy Your Neighbors

Let’s iron out the details with a program and test it on some dummy data. Let’s construct a set of points in $ \mathbb{R}^2$ which manifestly satisfies the axiom of neighborliness. To do this, we’ll use Python’s random library to make a dataset sampled from two independent normal distributions.

import random

def gaussCluster(center, stdDev, count=50):
    return [(random.gauss(center[0], stdDev),
             random.gauss(center[1], stdDev)) for _ in range(count)]

def makeDummyData():
    return gaussCluster((-4,0), 1) + gaussCluster((4,0), 1)

The first function simply returns a cluster of points drawn from the specified normal distribution. For simplicity we equalize the covariance of the two random variables. The second function simply combines two clusters into a data set.

To give the dummy data class “labels,” we’ll simply have a second list that we keep alongside the data. The index of a data point in the first list corresponds to the index of its class label in the second. There are likely more elegant ways to organize this, but it suffices for now.

Implementing a metric is similarly straightforward. For now, we’ll use the standard Euclidean metric. That is, we simply take the sum of the squared differences of the coordinates of the given two points.

import math

def euclideanDistance(x,y):
    return math.sqrt(sum([(a-b)**2 for (a,b) in zip(x,y)]))

To actually implement the classifier, we create a function which itself returns a function.

import heapq

def makeKNNClassifier(data, labels, k, distance):
    def classify(x):
        closestPoints = heapq.nsmallest(k, enumerate(data),
                                        key=lambda y: distance(x, y[1]))
        closestLabels = [labels[i] for (i, pt) in closestPoints]
        return max(set(closestLabels), key=closestLabels.count)

    return classify

There are a few tricky things going on in this function that deserve discussion. First and foremost, we are defining a function within another function, and returning the created function. The important technical point here is that the created function retains all local variables which are in scope even after the function ends. Specifically, you can call “makeKNNClassifier” multiple times with different arguments, and the returned functions won’t interfere with each other. One is said to close over the values in the environment, and so this programming language feature is called a function closure or just a closure, for short. It allows us, for instance, to keep important data visible while hiding any low-level data it depends on, but which we don’t access directly. From a high level, the decision function entirely represents the logic of the program, and so this view is justified.

Second, we are using some relatively Pythonic constructions. The first line of “classify” uses of heapq to pick the $ k$ smallest elements of the data list, but in addition we use “enumerate” to preserve the index of the returned elements, and a custom key to have the judgement of “smallest” be determined by the custom distance function. Note that the indexed “y[1]” in the lambda function uses the point represented by “y” and not the saved index.

The second line simply extracts a list of the labels corresponding each of the closest points returned by the call to “nsmallest.” Finally, the third line returns the maximum of the given labels, where a label’s weight (determined by the poorly named “key” lambda) is its frequency in the “closestLabels” list.

Using these functions is quite simple:

trainingPoints = makeDummyData() # has 50 points from each class
trainingLabels = [1] * 50 + [2] * 50  # an arbitrary choice of labeling

f = makeKNNClassifier(trainingPoints, trainingLabels, 8, euclideanDistance)
print f((-3,0))

The reader may fiddle around with this example as desired, but we will not pursue it further. As usual, all code used in this post is available on this blog’s Github page. Let’s move on to something more difficult.

Handwritten Digits

One of the most classic examples in the classification literature is in recognizing handwritten digits. This originally showed up (as the legend goes) in the context of the United States Postal Service for the purpose of automatically sorting mail by the zip code of the destination. Although this author has no quantitative proof, the successful implementation of a scheme would likely save an enormous amount of labor and money. According to the Postal Facts site, there are 31,509 postal offices in the U.S. and, assuming each one processes mail, there is at least one employee at each office who would spend some time sorting by zip code. Given that the USPS processes 23 million pieces of mail per hour, a conservative estimate puts each office spending two hours of labor per day on sorting mail by zip code (resulting in a very rapid pace of 146.52 pieces of mail sorted per minute per worker). At a lower bound of $18/hr this amounts to a cost of $1,134,324 per day, or over 400 million dollars per year. Put in perspective, in one year the amount of money saved equals the entire two-year tuition of Moraine Valley Community College for 68,000 students (twice the current enrollment).

In short, the problem of sorting mail (and of classifying handwritten digits) begs to be automated, and indeed it has been to some degree for about four decades. Let’s see how k-nearest-neighbors fares in this realm.

We obtain our data from the UCI machine learning repository, and with a few minor modifications, we present it on this blog’s Github page (along with the rest of the code used in this post). A single line of the data file represents a handwritten digit and its label. The digit is a 256-element vector obtained by flattening a 16×16 binary-valued image in row-major order; the label is an integer representing the number in the picture. The data file contains 1593 instances with about 160 instances per digit.

In other words, our metric space is $ \left \{ 0,1 \right \}^{256}$, and we choose the Euclidean metric for simplicity. With the line wrapping to better display the “image,” one line from the data file looks like:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 
0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 
0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 
1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0, 6

After reading in the data appropriately, we randomly split the data set into two pieces, train on one piece, and test on the other. The following function does this,  returning the success rate of the classification algorithm on the testing piece.

import knn
import random

def column(A, j):
   return [row[j] for row in A]

def test(data, k):
   pts, labels = column(data, 0), column(data, 1)

   trainingData = pts[:800]
   trainingLabels = labels[:800]
   testData = pts[800:]
   testLabels = labels[800:]

   f = knn.makeKNNClassifier(trainingData, trainingLabels,
                             k, knn.euclideanDistance)
   correct = 0
   total = len(testLabels)

   for (point, label) in zip(testData, testLabels):
      if f(point) == label:
         correct += 1

   return float(correct) / total

A run with $ k=1$ gives a surprisingly good 89% success rate. Varying $ k$, we see this is about as good as it gets without any modifications to the algorithm or the metric. Indeed, the graph below shows that the handwritten digits data set agrees with the axiom of neighborliness to a fair approximation.

A graph of classification accuracy against k for values of k between 1 and 50. The graph clearly shows a downward trend as k increases, but all values k < 10 are comparably good.

Of course, there are many improvements we could make to this naive algorithm. But considering that it utilizes no domain knowledge and doesn’t manipulate the input data in any way, it’s not too shabby.

As a side note, it would be fun to get some tablet software and have it use this method to recognize numbers as one writes it. Alas, we have little time for these sorts of applications.

Advantages, Enhancements, and Problems

One reason k-nearest-neighbors is such a common and widely-known algorithm is its ease of implementation. Indeed, we implemented the core algorithm in a mere three lines of Python. On top of that, k-nearest-neighbors is pleasingly parallel, and inherently flexible. Unlike the Perceptron algorithm, which relies on linear separability, k-nearest-neighbors and the axiom of neighborliness allow for datasets with many different geometric structures. These lecture notes give a good example, as shown below, and the reader can surely conjure many more.

k-nearest-neighbors applied to a data set organized in concentric circles.

And of course, the flexibility is even greater by virtue of being able to use any metric for distance computations. One may, for instance, use the Manhattan metric if the points in question are locations in a city. Or if the data is sequential, one could use the dynamic time warping distance (which isn’t truly a metric, but is still useful). The possibilities are only limited by the discovery of new and useful metrics.

With such popularity, k-nearest-neighbors often comes with a number of modifications and enhancements. One enhancement is to heuristically remove certain points close to the decision boundary. This technique is called edited k-nearest-neighbors. Another is to weight certain features heavier in the distance computations, which requires one to programmatically determine which features help less with classification. This is getting close to the realm of a decision tree, and so we’ll leave this as an exercise to the reader.

The next improvement has to do with runtime. Given $ n$ training points and $ d$ features (d for dimension), one point requires $ O(nd)$ to classify. This is particularly expensive, because most of the distance computations performed are between points that are far away, and as $ k$ is usually small, they won’t influence in the classification.

One way to alleviate this is to store the data points in a data structure called a k-d tree. The k-d tree originated in computational geometry in the problem of point location. It partitions space into pieces based on the number of points in each resulting piece, and organizes the partitions into a tree. In other words, it will partition tightly where the points are dense, and loosely where the points are sparse. At each step of traversing the tree, one can check to see which sub-partition the unclassified point lies in, and descend appropriately. With certain guarantees, this reduces the computation to $ O(\log(n)d)$. Unfortunately, there are issues with large-dimensional spaces that are beyond the scope of this post. We plan to investigate k-d trees further in a future series on computational geometry.

The last issue we consider is in data scaling. Specifically, one needs to be very careful when converting the real world data into numerical data. We can think of each of the features as a random variable, and we want all of these random variables to have comparable variation. The reason is simply because we’re using spheres. One can describe k-nearest-neighbors as finding the smallest (filled-in) sphere centered at the unlabeled point containing $ k$ labeled data points, and using the most common of those labels to classify the new point. Of course, one can talk about “spheres” in any metric space; it’s just the set of all points within some fixed distance from the center (and this definition doesn’t depend on the dimension of the space). The important point is that a sphere has uniform length along every axis. If the data is scaled improperly, then the geometry of the sphere won’t mirror the geometry of the data, and the algorithm will flounder.

So now we’ve seen a smattering of topics about k-nearest-neighbors. We’d love to continue the discussion of modifications in the comments. Next time we’ll explore decision trees, and work with another data set. Until then!

The Perceptron, and All the Things it Can’t Perceive

This post assumes some basic familiarity with Euclidean geometry and linear algebra. Though we do not assume so much knowledge as is contained in our primer on inner product spaces, we will be working with the real Euclidean inner product. For the purpose of this post, it suffices to know about the “dot product” of two vectors.

The General Problem

One of the main problems in machine learning is to classify data. Rigorously, we have a set of points called a training set $ X = \left \{ \mathbf{p_i} \right \} \subset \mathbb{R}^n$ generated by some unknown distribution $ D$. These points are often numerical representations of real-world data. For instance, one dimension could represent age, while another could represent cholesterol level. Then each point $ \mathbf{p_i}$ would be a tuple containing the relevant numerical data for a patient. Additionally, each point has an associated label $ l_i = \pm 1$, which represents the class that piece of data belongs to. Continuing our cholesterol example, the labels here could represent whether the patient has heart disease. For now, we stick to two classes, though iterated techniques extend any binary classification method to a multiple-class problem.

Given such a training set, we seek a decision function $ f : \mathbb{R}^n \to \left \{ \pm 1 \right \}$ which is consistent with our training set (i.e. $ f(\mathbf{p_i}) = l_i$ for each $ i$) and generalizes well to all data generated by $ D$ (i.e. new data points not part of the training set). We want our decision function to treat future heart-disease patients as well as correctly classify those from the past.

With no restrictions on $ f$, one would imagine such a function is wild! In fact, if the distribution $ D$ has no patterns in it (e.g. random noise), then no such $ f$ exists. However, for many distributions one can discover functions which are good approximations. They don’t even have to be consistent with the training set, as long as they work reliably in general.

Arguably the simplest non-trivial example of such a decision function is a line in the plane which separates a training set $ X \subset \mathbb{R}^2$ into two pieces, one for each class label. This is precisely what the perceptron model does, and it generalizes to separating hyperplanes in $ \mathbb{R}^n$.

The Dawn of Machine-Kind: the Perceptron

“Now, consider the following: you were admitted to this robot asylum. Therefore, you must be a robot. Diagnosis complete.” ― Dr. Perceptron to Fry, Futurama.

The very first algorithm for classification was invented in 1957 by Frank Rosenblatt, and is called the perceptron. The perceptron is a type of artificial neural network, which is a mathematical object argued to be a simplification of the human brain. While at first the model was imagined to have powerful capabilities, after some scrutiny it has been proven to be rather weak by itself. We will formulate it in its entirety here.

Most readers will be familiar with the equation of a line. For instance, $ y = -2x$ is a popular line. We rewrite this in an interesting way that generalizes to higher dimensions.

First, rewrite the equation in normal form: $ 2x + y = 0$. Then, we notice that the coefficients of the two variables form a vector $ (2,1)$, so we may rewrite it with linear algebra as $ \left \langle (2,1),(x,y) \right \rangle = 0$, where the angle bracket notation represents the standard Euclidean inner product, also known as the dot product. We note that by manipulating the values of the coefficient vector, we can get all possible lines that pass through the origin.

We may extend this to all lines that pass through any point by introducing a bias weight, which is a constant term added on which shifts the line. In this case, we might use $ -1$ to get $ \left \langle (1,2),(x,y) \right \rangle – 1 = 0$. The term “bias” is standard in the neural network literature. It might be better to simply call it the “constant weight,” but alas, we will follow the crowd.

We give the coefficient vector and the bias special variable names, $ \mathbf{w}$ (for the common alternative name weight vector) and $ b$, respectively. Finally, the vector of variables $ (x,y)$ will be henceforth denoted $ \mathbf{x} = (x_1, x_2, \dots , x_n)$. With these new naming conventions, we rewrite the line equation (one final time) as $ \left \langle \mathbf{w}, \mathbf{x} \right \rangle + b = 0$.

Now we let the dimension of $ \mathbb{R}^n$ vary. In three dimensions, our $ \left \langle \mathbf{w}, \mathbf{x} \right \rangle + b = 0$ becomes $ w_1x_1 + w_2x_2 + w_3x_3 + b = 0$, the equation of a plane. As $ n$ increases, the number of variables does as well, making our “line” equation generalize to a hyperplane, which in the parlance of geometry is just an affine subspace of dimension $ n-1$. In any case, it suffices to picture a line in the plane, or a plane in three dimensions; the computations there are identical to an arbitrary $ \mathbb{R}^n$.

The usefulness of this rewritten form is in its evaluation of points not on the hyperplane. In particular, we define $ f : \mathbb{R}^n \to \mathbb{R}$ by $ f(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle + b$. Then taking two points $ \mathbf{x,y}$ on opposite sides of the hyperplane, we get that $ f(\mathbf{x}) < 0 < f(\mathbf{y})$ or $ f(\mathbf{y}) < 0 < f(\mathbf{x})$.

So if we can find the right coefficient vector and bias weight, such that the resulting hyperplane separates the points in our training set, then our decision function could just be which side of the line they fall on, i.e. $ \textup{sign}(f(\mathbf{x}))$. For instance, here is a bunch of randomly generated data in the unit square in $ \mathbb{R}^2$, and a separating hyperplane (here, a line).

A line separating the blue data points (-1) from the red data points (+1). The diagonal boundary of the blue shaded area is the separating line, and the blue shaded area represents the set of points the corresponding classification function would deem to be in the red class (+1).

The blue region is the region within the unit square where $ f(\mathbf{x}) \geq 0$, and so it includes the decision boundary. If we receive any new points, we could easily classify them by which side of this hyperplane they fall on.

Before we stop to think about whether such a hyperplane exists for every data set (dun dun dun!) let’s go ahead and construct an algorithm to find it. There is a very straightforward way to proceed: as long as the separating hyperplane makes mistakes, update the coefficient vector and bias to remove a mistake. In order to converge, we simply need that the amount by which we push the coefficient vector is small enough. Here’s a bit of pseudocode implementing that idea:

hyperplane = [0, 0, ..., 0], bias = 0
while some misclassification is made in the training set:
   for each training example (point, label):
      if label * (<hyperplane, point> + bias) <= 0:
         hyperplane += label * point
         bias += label

Upon encountering an error, we simply push the coefficients and bias in the direction of the failing point. Of course, the convergence of such an algorithm must be proven, but it is beyond the scope of this post to do so. The interested reader should see these “do-it-yourself” notes. The proof basically boils down to shuffling around inequalities, squeezing things, and applying the Cauchy-Schwarz inequality. The result is that the number of mistakes made by the algorithm before it converges is directly proportional to the volume enclosed by the training examples, and inversely proportional to the smallest distance from the separating hyperplane to any training point.

Implementation (and Pictures!)

As usual, we implemented this algorithm in Mathematica. And albeit with a number of helper functions for managing training sets in a readable way, and a translation of the pseudocode algorithm into functional programming, the core of the implementation is about as many lines as the pseudocode above. Of course, we include the entire notebook on this blog’s Github page. Here are our results:

We generate a set of 25 data points in the unit square, with points above the line $ y = 1-x$ in one class, and those below in the other. This animation shows the updates to the hyperplane at every step of the inner loop, stopping when it finds a separating line.

First, we note that this algorithm does not find “the best” separating line by any means. By “the best,” consider the following three pictures:

Clearly, the third picture achieves the “best” separation, because it has a larger distance from the training points than the other two. In other words, if we compute the training set margin $ \gamma = \textup{min}(f(\mathbf{p_i}))$, we claim a large $ \gamma$ implies a better separation. While the original perceptron algorithm presented here does not achieve a particularly small $ \gamma$ in general, we will soon (in a future post) modify it to always achieve the maximum margin among all separating hyperplanes. This will turn out to take a bit more work, because it becomes a convex quadratic optimization problem. In other words, finding the best separating hyperplane is conceptually and computationally more difficult than finding any separating hyperplane.

Finally, we test its generalization on a new set of 1000 generated points. We see that even with as few as 25 training samples, we get an accuracy of about 92 percent!

92.2% generalization accuracy. Not too shabby!

Here the blue region is the region of generated data in class +1, the red region (small sliver in the lower right corner) is the region that the perceptron falsely claims is in class +1, while the purple area is the overlap of the perceptron’s perceived +1 region and the true +1 region.

For a few measly lines of pseudocode, this algorithm packs a punch!

The Problem With Lines

As eagerly as we’d like to apply the perceptron algorithm to solve the problems of the world, we must take a step back and analyze the acceptable problems. In particular (and though this sounds obvious), the perceptron can only find hyperplanes separating things which can be separated by hyperplanes! We call such a training set linearly separable, and we admit that not every training set is so.

The smallest possible such example is three points on a line, where one point in class +1 separates two points in class -1. Historically, however, the first confounding counterexample presented was exclusive “or” function. Specifically, we have four points of the unit square arranged as follows:

(0,0), (1,1) -> +1
(1,0), (0,1) -> -1

The reader can verify that the perceptron loops infinitely on either of the given training sets, oscillating between the same hyperplanes over and over again.

Even though we may not have a linearly separable training set, we could still approximate a separation boundary with a hyperplane. Thus, we’d want to minimize the number of misclassifications of a hyperplane with respect to that training set. Amazingly, this problem in NP-complete. In other words, it is widely believed that the problem can’t be solved in polynomial time, so we can forget about finding a useful algorithm that works on large data sets. For a more complete discussion of NP-completeness, see this blog’s primer on the subject.

The need for linearly separable training data sets is a crippling problem for the perceptron. Most real-world distributions tend to be non-linear, and so anything which cannot deal with them is effectively a mathematical curiosity. In fact, for about twenty years after this flaw was discovered, the world lost interest in neural networks entirely. In our future posts, we will investigate the various ways researchers overcame this. But for now, we look at alternate forms of the perceptron algorithm.

The Dual Representation

We first note that the initial coefficient vector and bias weight for the perceptron were zero. At each step, we added or subtracted the training points to the coefficient vector. In particular, our final coefficient vector was simply a linear combination of the training points:

$ \displaystyle \mathbf{w} = \sum \limits_{i=1}^k \alpha_i l_i \mathbf{p_i}$

Here the $ \alpha_i$ are directly proportional (in general) to the number of times $ \mathbf{p_i}$ was misclassified by the algorithm. In other words, points which cause fewer mistakes, or those which are “easy” to classify, have smaller $ \alpha_i$. Yet another view is that the points with higher $ \alpha_i$ have greater information content; we can learn more about our distribution by studying them.

We may think of the $ \mathbf{\alpha}$ vector as a dual system of coordinates by which we may represent our hyperplane. Instead of this vector having the dimension of the Euclidean space we’re working in, it has dimension equal to the number of training examples. For problems in very high dimension (or perhaps infinite dimensions), this shift in perspective is the only way one can make any computational progress. Indeed, the dual problem is the crux of such methods like the so-called support vector machines.

Furthermore, once we realize that the hyperplane’s coordinate vector can be written in terms of the training points, we may rewrite our decision function as follows:

$ \displaystyle f(\mathbf{x}) = \textup{sign} \left ( \left \langle \sum \limits_{i=1}^k \alpha_i l_i \mathbf{p_i}, \mathbf{x} \right \rangle + b \right )$

By the linearity of an inner product, this becomes

$ \displaystyle f(\mathbf{x}) = \textup{sign} \left ( \sum \limits_{i=1}^k \alpha_i l_i \left \langle \mathbf{p_i}, \mathbf{x} \right \rangle + b \right )$

And the perceptron algorithm follows suit:

alpha = [0, 0, ..., 0], b = 0
while some misclassification is made in the training set:
   for each training example (i, point, label):
      if f(point, label) <= 0:
         alpha[i] += 1
         b += label

So in addition to finding a separating hyperplane (should one exist), we have a method for describing information content. Since we did not implement this particular version of the perceptron algorithm in our Mathematica notebook, we challenge the reader to do so. Once again, the code is available on this blog’s Github page, and feel free to leave a comment with your implementation.

Until next time!