The Perceptron, and All the Things it Can’t Perceive

This post assumes some basic familiarity with Euclidean geometry and linear algebra. Though we do not assume so much knowledge as is contained in our primer on inner product spaces, we will be working with the real Euclidean inner product. For the purpose of this post, it suffices to know about the “dot product” of two vectors.

The General Problem

One of the main problems in machine learning is to classify data. Rigorously, we have a set of points called a training set $ X = \left \{ \mathbf{p_i} \right \} \subset \mathbb{R}^n$ generated by some unknown distribution $ D$. These points are often numerical representations of real-world data. For instance, one dimension could represent age, while another could represent cholesterol level. Then each point $ \mathbf{p_i}$ would be a tuple containing the relevant numerical data for a patient. Additionally, each point has an associated label $ l_i = \pm 1$, which represents the class that piece of data belongs to. Continuing our cholesterol example, the labels here could represent whether the patient has heart disease. For now, we stick to two classes, though iterated techniques extend any binary classification method to a multiple-class problem.

Given such a training set, we seek a decision function $ f : \mathbb{R}^n \to \left \{ \pm 1 \right \}$ which is consistent with our training set (i.e. $ f(\mathbf{p_i}) = l_i$ for each $ i$) and generalizes well to all data generated by $ D$ (i.e. new data points not part of the training set). We want our decision function to treat future heart-disease patients as well as correctly classify those from the past.

With no restrictions on $ f$, one would imagine such a function is wild! In fact, if the distribution $ D$ has no patterns in it (e.g. random noise), then no such $ f$ exists. However, for many distributions one can discover functions which are good approximations. They don’t even have to be consistent with the training set, as long as they work reliably in general.

Arguably the simplest non-trivial example of such a decision function is a line in the plane which separates a training set $ X \subset \mathbb{R}^2$ into two pieces, one for each class label. This is precisely what the perceptron model does, and it generalizes to separating hyperplanes in $ \mathbb{R}^n$.

The Dawn of Machine-Kind: the Perceptron

“Now, consider the following: you were admitted to this robot asylum. Therefore, you must be a robot. Diagnosis complete.” ― Dr. Perceptron to Fry, Futurama.

The very first algorithm for classification was invented in 1957 by Frank Rosenblatt, and is called the perceptron. The perceptron is a type of artificial neural network, which is a mathematical object argued to be a simplification of the human brain. While at first the model was imagined to have powerful capabilities, after some scrutiny it has been proven to be rather weak by itself. We will formulate it in its entirety here.

Most readers will be familiar with the equation of a line. For instance, $ y = -2x$ is a popular line. We rewrite this in an interesting way that generalizes to higher dimensions.

First, rewrite the equation in normal form: $ 2x + y = 0$. Then, we notice that the coefficients of the two variables form a vector $ (2,1)$, so we may rewrite it with linear algebra as $ \left \langle (2,1),(x,y) \right \rangle = 0$, where the angle bracket notation represents the standard Euclidean inner product, also known as the dot product. We note that by manipulating the values of the coefficient vector, we can get all possible lines that pass through the origin.

We may extend this to all lines that pass through any point by introducing a bias weight, which is a constant term added on which shifts the line. In this case, we might use $ -1$ to get $ \left \langle (1,2),(x,y) \right \rangle – 1 = 0$. The term “bias” is standard in the neural network literature. It might be better to simply call it the “constant weight,” but alas, we will follow the crowd.

We give the coefficient vector and the bias special variable names, $ \mathbf{w}$ (for the common alternative name weight vector) and $ b$, respectively. Finally, the vector of variables $ (x,y)$ will be henceforth denoted $ \mathbf{x} = (x_1, x_2, \dots , x_n)$. With these new naming conventions, we rewrite the line equation (one final time) as $ \left \langle \mathbf{w}, \mathbf{x} \right \rangle + b = 0$.

Now we let the dimension of $ \mathbb{R}^n$ vary. In three dimensions, our $ \left \langle \mathbf{w}, \mathbf{x} \right \rangle + b = 0$ becomes $ w_1x_1 + w_2x_2 + w_3x_3 + b = 0$, the equation of a plane. As $ n$ increases, the number of variables does as well, making our “line” equation generalize to a hyperplane, which in the parlance of geometry is just an affine subspace of dimension $ n-1$. In any case, it suffices to picture a line in the plane, or a plane in three dimensions; the computations there are identical to an arbitrary $ \mathbb{R}^n$.

The usefulness of this rewritten form is in its evaluation of points not on the hyperplane. In particular, we define $ f : \mathbb{R}^n \to \mathbb{R}$ by $ f(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle + b$. Then taking two points $ \mathbf{x,y}$ on opposite sides of the hyperplane, we get that $ f(\mathbf{x}) < 0 < f(\mathbf{y})$ or $ f(\mathbf{y}) < 0 < f(\mathbf{x})$.

So if we can find the right coefficient vector and bias weight, such that the resulting hyperplane separates the points in our training set, then our decision function could just be which side of the line they fall on, i.e. $ \textup{sign}(f(\mathbf{x}))$. For instance, here is a bunch of randomly generated data in the unit square in $ \mathbb{R}^2$, and a separating hyperplane (here, a line).

A line separating the blue data points (-1) from the red data points (+1). The diagonal boundary of the blue shaded area is the separating line, and the blue shaded area represents the set of points the corresponding classification function would deem to be in the red class (+1).

The blue region is the region within the unit square where $ f(\mathbf{x}) \geq 0$, and so it includes the decision boundary. If we receive any new points, we could easily classify them by which side of this hyperplane they fall on.

Before we stop to think about whether such a hyperplane exists for every data set (dun dun dun!) let’s go ahead and construct an algorithm to find it. There is a very straightforward way to proceed: as long as the separating hyperplane makes mistakes, update the coefficient vector and bias to remove a mistake. In order to converge, we simply need that the amount by which we push the coefficient vector is small enough. Here’s a bit of pseudocode implementing that idea:

hyperplane = [0, 0, ..., 0], bias = 0
while some misclassification is made in the training set:
   for each training example (point, label):
      if label * (<hyperplane, point> + bias) <= 0:
         hyperplane += label * point
         bias += label

Upon encountering an error, we simply push the coefficients and bias in the direction of the failing point. Of course, the convergence of such an algorithm must be proven, but it is beyond the scope of this post to do so. The interested reader should see these “do-it-yourself” notes. The proof basically boils down to shuffling around inequalities, squeezing things, and applying the Cauchy-Schwarz inequality. The result is that the number of mistakes made by the algorithm before it converges is directly proportional to the volume enclosed by the training examples, and inversely proportional to the smallest distance from the separating hyperplane to any training point.

Implementation (and Pictures!)

As usual, we implemented this algorithm in Mathematica. And albeit with a number of helper functions for managing training sets in a readable way, and a translation of the pseudocode algorithm into functional programming, the core of the implementation is about as many lines as the pseudocode above. Of course, we include the entire notebook on this blog’s Github page. Here are our results:

We generate a set of 25 data points in the unit square, with points above the line $ y = 1-x$ in one class, and those below in the other. This animation shows the updates to the hyperplane at every step of the inner loop, stopping when it finds a separating line.

First, we note that this algorithm does not find “the best” separating line by any means. By “the best,” consider the following three pictures:

Clearly, the third picture achieves the “best” separation, because it has a larger distance from the training points than the other two. In other words, if we compute the training set margin $ \gamma = \textup{min}(f(\mathbf{p_i}))$, we claim a large $ \gamma$ implies a better separation. While the original perceptron algorithm presented here does not achieve a particularly small $ \gamma$ in general, we will soon (in a future post) modify it to always achieve the maximum margin among all separating hyperplanes. This will turn out to take a bit more work, because it becomes a convex quadratic optimization problem. In other words, finding the best separating hyperplane is conceptually and computationally more difficult than finding any separating hyperplane.

Finally, we test its generalization on a new set of 1000 generated points. We see that even with as few as 25 training samples, we get an accuracy of about 92 percent!

92.2% generalization accuracy. Not too shabby!

Here the blue region is the region of generated data in class +1, the red region (small sliver in the lower right corner) is the region that the perceptron falsely claims is in class +1, while the purple area is the overlap of the perceptron’s perceived +1 region and the true +1 region.

For a few measly lines of pseudocode, this algorithm packs a punch!

The Problem With Lines

As eagerly as we’d like to apply the perceptron algorithm to solve the problems of the world, we must take a step back and analyze the acceptable problems. In particular (and though this sounds obvious), the perceptron can only find hyperplanes separating things which can be separated by hyperplanes! We call such a training set linearly separable, and we admit that not every training set is so.

The smallest possible such example is three points on a line, where one point in class +1 separates two points in class -1. Historically, however, the first confounding counterexample presented was exclusive “or” function. Specifically, we have four points of the unit square arranged as follows:

(0,0), (1,1) -> +1
(1,0), (0,1) -> -1

The reader can verify that the perceptron loops infinitely on either of the given training sets, oscillating between the same hyperplanes over and over again.

Even though we may not have a linearly separable training set, we could still approximate a separation boundary with a hyperplane. Thus, we’d want to minimize the number of misclassifications of a hyperplane with respect to that training set. Amazingly, this problem in NP-complete. In other words, it is widely believed that the problem can’t be solved in polynomial time, so we can forget about finding a useful algorithm that works on large data sets. For a more complete discussion of NP-completeness, see this blog’s primer on the subject.

The need for linearly separable training data sets is a crippling problem for the perceptron. Most real-world distributions tend to be non-linear, and so anything which cannot deal with them is effectively a mathematical curiosity. In fact, for about twenty years after this flaw was discovered, the world lost interest in neural networks entirely. In our future posts, we will investigate the various ways researchers overcame this. But for now, we look at alternate forms of the perceptron algorithm.

The Dual Representation

We first note that the initial coefficient vector and bias weight for the perceptron were zero. At each step, we added or subtracted the training points to the coefficient vector. In particular, our final coefficient vector was simply a linear combination of the training points:

$ \displaystyle \mathbf{w} = \sum \limits_{i=1}^k \alpha_i l_i \mathbf{p_i}$

Here the $ \alpha_i$ are directly proportional (in general) to the number of times $ \mathbf{p_i}$ was misclassified by the algorithm. In other words, points which cause fewer mistakes, or those which are “easy” to classify, have smaller $ \alpha_i$. Yet another view is that the points with higher $ \alpha_i$ have greater information content; we can learn more about our distribution by studying them.

We may think of the $ \mathbf{\alpha}$ vector as a dual system of coordinates by which we may represent our hyperplane. Instead of this vector having the dimension of the Euclidean space we’re working in, it has dimension equal to the number of training examples. For problems in very high dimension (or perhaps infinite dimensions), this shift in perspective is the only way one can make any computational progress. Indeed, the dual problem is the crux of such methods like the so-called support vector machines.

Furthermore, once we realize that the hyperplane’s coordinate vector can be written in terms of the training points, we may rewrite our decision function as follows:

$ \displaystyle f(\mathbf{x}) = \textup{sign} \left ( \left \langle \sum \limits_{i=1}^k \alpha_i l_i \mathbf{p_i}, \mathbf{x} \right \rangle + b \right )$

By the linearity of an inner product, this becomes

$ \displaystyle f(\mathbf{x}) = \textup{sign} \left ( \sum \limits_{i=1}^k \alpha_i l_i \left \langle \mathbf{p_i}, \mathbf{x} \right \rangle + b \right )$

And the perceptron algorithm follows suit:

alpha = [0, 0, ..., 0], b = 0
while some misclassification is made in the training set:
   for each training example (i, point, label):
      if f(point, label) <= 0:
         alpha[i] += 1
         b += label

So in addition to finding a separating hyperplane (should one exist), we have a method for describing information content. Since we did not implement this particular version of the perceptron algorithm in our Mathematica notebook, we challenge the reader to do so. Once again, the code is available on this blog’s Github page, and feel free to leave a comment with your implementation.

Until next time!

Graph Coloring, or Proof by Crayon

This is a precursor to a post which will actually use graph coloring to do interesting computational things. Even so, there are many fascinating ideas and theorems that result from graph coloring, so we devote an entire post just to it. For those who need an additional primer on the basic ideas of graph theory, see our gentle primer on the subject.

Why, Paris Hath Colour Enough.

How many colors are required to color the provinces of Costa Rica?

How many colors are required to color the provinces of Costa Rica?

A common visual aid for maps is to color the regions of the map differently, so that no two regions which share a border also share a color. For example, to the right is a map of the provinces of Costa Rica (where the author is presently spending his vacation). It is colored with eight different colors, one for each province. Of course, before the days of 24-bit “true color,” designing maps was a long and detailed process, and coloring them was expensive. It was in a cartographer’s financial interest to minimize the number of colors used for any given map. This raises the obvious question, what is the minimum number of colors required to color Costa Rica in a way that no two provinces which share a common border share a common color?

With a bit of fiddling, the reader will see that the answer is obviously three. The astute reader will go further to notice that this map contains a “triangle” of three provinces which pairwise share common borders. Hence even these three alone cannot be colored with just two.

This problem becomes much trickier, and much more interesting to we mathematicians, as the number of provinces increases. So let us consider the map of the arrondissements of Paris.

The twenty arrondissements of Paris, which are arranged in a strangely nautilusy spiral.

With so many common borders, we require a more mathematically-oriented description of the problem. Enter graph theory.

Let us construct an undirected graph $ P$ (for Paris), in which the vertices are the arrondissements, and two vertices are connected by an edge if and only if they share a common border. In general, when we color graphs we do not consider two regions to share a border if their borders intersect in a single point, as in the four corners of the United States. Performing this construction on Paris, we get the following graph. Recognizing that the positions of the corresponding vertices are irrelevant, we shift them around to have a nicely spread-out picture.

Now that the relationships between arrondissements are decidedly unambiguous, we may rigorously define the problem of coloring a graph.

Definition: A coloring of a graph $ G = (V,E,\varphi )$ is a map $ f: V \to \left \{ 1 \dots n \right \}$, such that if $ v,w \in V$ are connected by an edge, then $ f(v) \neq f(w)$. We call $ n$ the size of a coloring, and if $ G$ has a coloring of size $ n$ we say that $ G$ is $ n$-colorable, or that it has an $ n$-coloring.

Definition: The chromatic number of a graph $ G$, denoted $ \chi(G)$ is the smallest positive number $ k$ such that $ G$ is $ k$-colorable.

Our first map of Costa Rica was 3-colorable, and we saw indeed that $ \chi(G) = 3$.

We now turn our attention to coloring Paris. To do so, we use the following “greedy” algorithm. For the sake of the pseudocode, we represent colors by natural numbers.

Given a set of vertices V of a graph G:
   For each vertex w in V:
      let S_w be the set of colors which have been previously
        assigned to neighbors of w
      Color w with the smallest number that is not in S_w

Obviously, the quality of the coloring depends on the order in which we investigate the vertices. Further, there always exists an ordering of the vertices for which this greedy algorithm produces a coloring which achieves the minimum required number of colors, $ \chi(G)$. Applying this to Pairs, we get the following 4-coloring:

Translating this back into its original form, we now have a pretty picture of Paris.

The Parisians will be thrilled when they find out they’re at most 4-colorable.

But this is not enough. Can we do it with just three colors? The call of the wild beckons! We challenge the reader to find a 3-coloring of Paris, or prove its impossibility. In the mean time, we have other matters to attend to.

We note (without proof) that to determine $ \chi(G)$ algorithmically is NP-complete in general. Colloquially this means a fast solution is believed not to exist, and to find one (or prove this belief is true) would immediately make one the most famous mathematician in the world. In other words, it’s very, very hard. The only general solutions take at worst an exponential or factorial amount of time with regards to the number of vertices in the graph. In other words, trying to run a general solution on a graph of 50 vertices is likely to take 50! (a 65-digit number) steps to finish, far beyond the computational capacity of any computer in the foreseeable future.

We Are But Planar Fellows, Sir.

Now we turn to a different aspect of graph theory, which somewhat simplifies the coloring process. The maps we have been investigating thus far are special. Specifically, their representations as graphs admit drawings in which no two edges cross. This is obvious by our construction, but on the other hand we recognize that not all graphs have such representations. For instance, try to draw the following graph so that no two edges cross:

If it seems hard, that’s because it’s impossible. The proof of this fact requires a bit more work, and we point the reader in the direction of Wagner’s Theorem. For now, we simply recognize that not all graphs have such drawings. This calls for a definition!

Definition: A simple graph is planar if it can be embedded in $ \mathbb{R}^2$, or, equivalently, if it can be drawn on a piece of paper in such a way that any two intersecting edges do so only at a vertex.

As we have seen, the graphs corresponding to maps of arrondissements or provinces or states are trivially planar. And after a lifetime of cartography, we might notice that every map we attempt to color admits a 4-coloring. We now have a grand conjecture, that every planar graph is 4-colorable.

The audacity of such a claim! You speak from experience, sir, but experience certainly does not imply proof. If we are to seriously tackle this problem, let us try a simpler statement first: that every planar graph is 5-colorable. In order to do so, we need a few lemmas:

Lemma: If a graph $ G$ has $ e$ edges, then $ \displaystyle 2e = \sum \limits_{v \in V} \textup{deg}(v)$.

Proof. Since the degree of a vertex is just the number of edges touching it, and every edge touches exactly two vertices, by counting up the degrees of each vertex we count twice the number of edges. $ \square$

Lemma: If a planar graph $ G$ has $ e$ edges and $ f$ faces (regions bounded by edges, or the exterior of a graph) then $ 2e \geq 3f$, and $ f \leq (2/3) e$.

Proof. In a planar graph, every face is bounded by at least three edges (by definition), and every edge touches at most two faces. Hence, $ 3f$ counts each edge at most twice, while $ 2e$ counts each face at least three times. $ \square$.

We also recognize without proof the Euler Characteristic Formula for Planar Graphs, (which has an equally easy proof, provided you understand what an edge contraction is): $ n-e+f=2$, where $ G$ has $ n$ vertices, $ e$ edges, and $ f$ faces.

Lemma: Every planar graph has a vertex of degree less than 6.

Proof. Supposing otherwise, we may arrive at a contradiction in Euler’s formula: $ 2 = v – e + f \leq v – e + (2/3) e$, so rearranging terms we get $ e \leq 3v – 6$, and so $ 2e \leq 6v – 12$. Recall that $ 2e$ is the sum of the degrees of the vertices of $ G$. If every vertex has degree at least 6, then $ 6v \leq 2e \leq 6v – 12$, which is just silly. $ \square$.

This is all we need for the five color theorem, so without further ado:

Theorem: Every planar graph admits a 5-coloring.

Proof. Clearly every graph on fewer than 6 vertices has a 5-coloring. We proceed by induction on the number of vertices.

Suppose to the contrary that $ G$ is a graph on $ n$ vertices which requires at least 6 colors. By our lemma above, $ G$ has a vertex $ x$ of degree less than 6. Removing $ x$ from $ G$, the inductive hypothesis provides a 5-coloring of the remaining graph on $ n-1$ vertices. We simply need to color $ x$ with one of these five colors, and the theorem is proved.

If $ x$ has fewer than 5 neighbors or its neighbors use fewer than 5 colors, then we may assign it the color that is not used in any of its neighbors. Otherwise, we may assume $ x$ has five neighbors, and each neighbor has a distinct color (presumably requiring $ x$ to have the sixth, new color). Let $ v_1, v_2, v_3, v_4, v_5$ be the neighbors of $ x$, labelled in some clockwise order around $ x$, and say they have colors $ 1,2,3,4,5$, respectively.

Consider $ G_{1,3}$, the subgraph of $ G$ which contains only vertices of colors 1 and 3. There are two cases: either $ v_1$ is in the same connected component as $ v_3$ (there exists a path connecting them which does not go through $ x$), or not. If not, then we may take the component which contains $ v_1$, and switch the colors 1 and 3. Then $ x$ is connected to two vertices of color 3, and we may color it 1.

Otherwise, there is a path from $ v_1$ to $ v_3$. This path is special, because with $ x$, the path $ v_1 v_3 x$ completely encapsulates $ v_2$ with vertices of colors 1 and 3 (and $ x$). The planarity of the graph implies, then, that $ v_2$ cannot be connected to either of $ v_4$ or $ v_5$. Hence, we may repeat the color-swapping  argument on $ G_{2,4}$, the subgraph of colors 2 and 4, to give $ x$ color 2. $ \square$.

So that was entirely done with elementary arguments, and is a pretty neat proof. Of course, we really want the head of the four-color theorem on a mantle, but he is a much more elusive beast.

Computers Doing Proofs? Blasphemy.

But that is what it took to finish off the four-color theorem. Despite a hundred years of research and diversions, it took a brute-force computer search of 1,936 configurations, and over 1200 computer hours, to prove that no counterexample to the four-color theorem exists.

It was proved in 1976 by Appel and Haken, and it was very controversial for a long time, in principle because no human could verify that the computer hadn’t made an error. On the other hand, it was not hard to verify the correctness of the program itself, but after 1200 CPU hours, who can guarantee a bit wasn’t mistakenly flipped by some cosmic rays or a magnetic malfunction? Luckily, the proof was further simplified and reproved over the next twenty years, so today the problem is widely believed to be solved.

So indeed it is true! We may color every map with just four colors! Unfortunately, even a sketch of the proof is far beyond the scope of this blog. We are content to marvel at the work ethic of other mathematicians.

Next time, we will discuss the applications of graph coloring to computer problems, such as job scheduling and register allocation in compilers.

Until then!

Google’s Page Rank – The Final Product

Dangling Nodes and Non-Uniqueness

Recall where we left off last time. Given a web $ W$ with no dangling nodes, the link matrix for $ W$ has 1 as an eigenvalue, and if the corresponding eigenspace has dimension 1, then any associated eigenvector gives a ranking of the pages in $ W$ which is consistent with our goals.

The first problem is that if there is a dangling node, our link matrix has a column of all zeros, and is no longer column-stochastic. In fact, such a non-negative matrix which has columns summing to 1 or columns of all zeros is called column-substochastic. We cannot guarantee that 1 is an eigenvalue, with the obvious counterexample being the zero matrix.

Second, as we saw last time, webs which have disconnected subwebs admit eigenspaces of large dimension, and hence our derived rankings are not unique.

We will fix both of these problems in one fell swoop: by adjusting the link matrix to make all entries positive.

The motivation for this comes from the knowledge of a particular theorem, called the Perron-Frobenius Theorem. While the general statement says a great deal, here are the parts we need:

Theorem: Let $ A$ be a positive, column-stochastic matrix. Then there exists a maximal eigenvalue $ 0 < \lambda \leq 1$ such that all other eigenvalues are strictly smaller in magnitude. Further, the eigenspace associated to $ \lambda$ has dimension 1. This unique eigenvalue and eigenvector (up to scaling) are called the Perron eigenvalue and eigenvector.

We won’t prove this theorem, because it requires a lot of additional background. But we will use it to make life great. All we need is a positive attitude.

A Drunkard’s Surf

Any tinkering with the link matrix must be done in a sensible way. Unfortunately, we can’t “add” new links to our web without destroying its original meaning. So we need to view the resulting link matrix in a different light. Enter probability theory.

So say you’re drunk. You’re surfing the web and at every page you click a random link. Suppose further that every page is just a list of links, so it’s not harder to find some than others, and you’re equally likely to pick any link on the site. If you continue surfing for a long time, you’d expect to see pages with lots of backlinks more often than those with few backlinks. As you sober up, you might realize that this is a great way to characterize how important a webpage is! You quickly write down the probabilities for an example web into a matrix, with each $ i,j$ entry being the probability that you click on a link from page $ j$ to go to page $ i$.

This is a bit of drunken mathematical gold! We’ve constructed precisely the same link matrix for a web, but found from a different perspective. Unfortunately, after more surfing you end up on a page that has no links. You cannot proceed, so you randomly type in a URL into the address bar, and continue on your merry way. Soon enough, you realize that you aren’t seeing the same webpages as you did in the first walk. This random URL must have taken you to a different connected component of the web. Brilliant! Herein we have the solution to both of our problems: add in some factor of random jumping.

To do this, and yet maintain column-stochasticity, we need to proportionally scale the elements of our matrix. Let $ A$ be our original link matrix, and $ B$ be the $ n \times n$ matrix with all entries $ 1/n$. Then form a new matrix:

$ C = pB + (1-p)A, 0 \leq p \leq 1$

In words, $ C$ has a factor of egalitarianism proportional to $ p$. All we need to verify is that $ C$ is still column-stochastic, and this is clear since each column sum looks like the following for a fixed $ j$ (and $ a_{i,j}$ denotes the corresponding entry of $ A$):

$ \sum \limits_{i=1}^n(\frac{p}{n} + (1-p)a_{i,j}) = p\sum \limits_{i=1}^n \frac{1}{n} + (1-p)\sum \limits_{i=1}^na_{i,j} = p + (1-p) = 1$

So, applying the Perron-Frobenius theorem to our new matrix (where the value $ p$ becomes a parameter in need of tuning), there is a unique largest positive eigenvalue $ \lambda \leq 1$ for this web with an eigenspace of dimension 1. Picking any eigenvector within that eigenspace and then normalizing it to a unit vector with positive entries, we have our ranking algorithm!

Aside: note that the assumption that the eigenvalue has all positive entries is not unfounded. This follows as a result of our new matrix $ C$ being irreducible. The details of this implication are unnecessary, as the Perron-Frobenius theorem provides the positivity of the Perron eigenvector. Furthermore, all other eigenvectors (corresponding to any eigenvalues) must have an entry which is either negative or non-real. Hence, we only have one available eigenvector to choose for any valid ranking.

Computing the Beast

A link matrix for a web of any reasonable size is massive. As of June 20th, 2011, Google has an indexed database of close to 46 billion web pages, and in its infancy at Stanford, PageRank was tested on a web of merely 24 million pages. Clearly, one big matrix cannot fit in the memory of a single machine. But before we get to the details of optimizing for scale, let us compute page ranks for webs of modest sizes.

The problem of computing eigenvectors was first studied around the beginning of the 1900s. The first published method was called the power method, and it involves approximating the limit of the sequence

$ \displaystyle v_{n+1} = \frac{Cv_n}{||Cv_n||}$

for some arbitrary initial starting vector $ v_0$. Intuitively, when we apply $ C$ to $ v$, it “pulls” $ v$ toward each of its eigenvectors proportionally to their associated eigenvalues. In other words, the largest eigenvalue dominates. So an element of this sequence which has a high index will have been pulled closer and closer to an eigenvector corresponding to the largest eigenvalue (in absolute value), and hence approximate that eigenvector. Since we don’t want our vector to grow arbitrarily in size, we normalize it at each step. A more detailed analysis exists which formalizes this concept, but we only need to utilize it.

Under a few assumptions this sequence is guaranteed to converge. Specifically, we require that the matrix is real-valued, there exists a dominant eigenvalue, and the starting vector has a nonzero component in the direction of an eigenvector corresponding to the dominant eigenvalue. Luckily for us, existence is taken care of by the Perron-Frobenius theorem, real-valuedness by construction, and we may trivially pick an initial vector of all 1s. Here is the pseudocode:

C <- link matrix
v <- (1,1,...,1)

while true:
   previous = v
   v = dot(C, v)
   v /= norm(v)
   break if norm(v-previous) < epsilon

return v

The only thing we have to worry about is the inefficiency in computing $ v = Cv$, when $ C$ is a matrix that is dense, in the sense that there are very few 0’s. That means computing $ Cv$ will take $ \Theta(n^2)$ multiplications, even though there are far fewer links in the original link matrix $ A$. For a reasonable web of 24 million pages, this is mind-bogglingly slow. So we will replace this line with its original derivation:

   v = (p/n)*v + (1-p)*dot(A,v)

With the recognition that there exist special algorithms for sparse matrix multiplication, this computation is much faster.

Finally, our choice of epsilon is important, because we have yet to speak of how fast this algorithm converges. According to the analysis (linked) above, we can say a lot more than big-O type runtime complexity. The sequence actually converges geometrically, in the sense that each $ v_k$ is closer to the limit than $ v_{k-1}$ by a factor of $ r, 0 < r < 1$, which is proven to be $ \frac{|\lambda_2|}{|\lambda_1|}$, the ratio of the second and first dominant eigenvalues. If $ \lambda_2$ is very close to $ \lambda_1$, then the method will converge slowly. However, according to the research done on PageRank, the value of $ r$ usually sits at about 0.85, making the convergence rather speedy. So picking reasonably small values of epsilon (say, $ 10^{-10}$) will not kill us.

Implementation

Of course, writing pseudocode is child’s play compared to actually implementing a real algorithm. For pedantic purposes we chose to write PageRank in Mathematica, which has some amazing visualization features for graphs and plots, and built-in computation rules for sparse matrix multiplication. Furthermore, Mathematica represents graphs as a single object, so we can have very readable code. You can download the entire Mathematica notebook presented here from this blog’s Github page.

The code for the algorithm itself is not even fifteen lines of code (though it’s wrapped here to fit). We make heavy use of the built in functions for manipulating graph objects, and you should read the Mathematica documentation on graphs for more detailed information. But it’s pretty self-explanatory, black-box type code.

rulesToPairs[i_ -> j_] := {i,j};
SetAttributes[rulesToPairs, Listable];

PageRank[graph_, p_] := Module[{v, n, prev, edgeRules, degrees,
                                linkMatrixRules, linkMatrix},
   edgeRules = EdgeRules[graph];
   degrees = VertexCount[graph];
   n = VertexCount[graph];

   (* setting up the sparse array as a list of rules *)
   linkMatrixRules =
      Table[{pt[[2]],pt[[1]]} -> 1/degrees[[pt[[1]]]]],
            {pt, rulesToPairs[edgeRules]}];
   linkMatrix = SparseArray[linkMatrixRules, {n, n}];

   v = Table[1.0, {n}];
   While[True,
      prev = v;
      v = (p/n) + (1-p)Dot[linkMatrix, v];
      v = v/Norm[v];
      If[Norm[v-prev] < 10^(-10), Break[]]
   ];

   Return[Round[N[v], 0.001]]
];

And now to test it, we simply provide it a graph object, which might look like

Graph[{1->2, 2->3, 3->4, 4->2, 4->1, 3->1, 2->4}]

And it spits out the appropriate answer. Now, the output of PageRank is just a list of numbers between 0 and 1, and it’s not very easy to see what’s going on as you change the parameter $ p$. So we have some visualization code that gives very pretty pictures. In particular, we set the size of each vertex to be its page rank. Page rank values are conveniently within the appropriate range for the VertexSize option.

visualizePageRanks[G_, p_] := Module[{ranks},
   ranks = PageRank[G,p];
   Show[
     Graph[
      EdgeRules[G], 
      VertexSize -> Table[i -> ranks[[i]], {i, VertexCount[G]}],
      VertexLabels -> "Name",
      ImagePadding -> 10
     ]
   ]
];

And we have some random graphs to work with:

randomGraph[numVertices_, numEdges_] :=
  RandomGraph[{numVertices, numEdges},
              DirectedEdges -> True,
              VertexLabels -> "Name",
              ImagePadding -> 10]

Here’s the result for a random graph on 10 vertices and 30 edges, with $ p = 0.25$.

The vertices are sized proportional to their page rank

Furthermore, using Mathematica’s neat (but slow) commands for animation, we can see what happens as we vary the parameter $ p$ between zero and one:

Pretty neat!

Surprisingly enough (given that this is our first try implementing PageRank), the algorithm scales without issue. A web of ten-thousand vertices and thirty-thousand edges takes a mere four seconds on an Atom 1.6 GHz processor with a gig of RAM. Unfortunately (and this is where Mathematica starts to show its deficiencies) the RandomGraph command doesn’t support constructions of graphs with as few as 100,000 vertices and 200,000 edges. We leave it as an exercise to the reader to test the algorithm on larger datasets (hint: construct a uniformly distributed list of random rules, then count up the out-degrees of each vertex, and modify the existing code to accept these as parameters).

To give a better idea of how the algorithm works with respect to varying parameters, we have the following two graphs. The first is a runtime plot for random graphs where the number of vertices is fixed at 100 and the number of edges varies between 100 and 1000. Interestingly enough, the algorithm seems to run quickest when there are close to twice the number of edges as there are vertices.

The graph dips around 200 edges, twice the number of vertices

Next, we investigate the effects of varying $ p$, the egalitarianism factor, between 0 and 1 for random graphs with 100 vertices and 200 edges. Unsurprisingly, the runtime is fastest when we are completely egalitarian, and $ p$ is close to 1.

The more egalitarian we are willing to be, the faster the ranking is performed.

Google reportedly used $ p = 0.15$, so there probably were not significant gains in performance from the tuning of that parameter alone. Further, the structure of the web is not uniform; obviously there are large link hubs and many smaller sites which have relatively few links. With a bit more research into the actual density of links in the internet, we could do much better simulations. However, this sort of testing is beyond the scope of this blog.

So there you have it! A fully-functional implementation of PageRank, which scales as well as one could hope a prototype to. Feel free to play around with the provided code (assuming you don’t mind Mathematica, which is honestly a very nice language), and comment with your findings!

Next time we’ll wrap up this series with a discussion of the real-world pitfalls of PageRank. We will likely stray away from mathematics, but the consequences of such a high-profile ranking algorithm is necessary for completeness.

Page Rank Series
An Introduction
A First Attempt
The Final Product
Why It Doesn’t Work Anymore

Well Orderings and Search

Binary Search

Binary search is perhaps the first and most basic nontrivial algorithm a student learns. For the mathematicians out there, binary search is a fast procedure to determine whether a sorted list contains a particular element. Here is a pseudocode implementation:

# Binary Search:
# Given a list L, sorted via the total order &lt;, and a sought
# element x, return true iff L contains x.

function binarySearch(L, x, &lt;):
   # base case
   if(length(L) == 1):
      return L[0] == x
   middleIndex = floor(length(L) / 2)
   if (L[middleIndex] == x):
      return true

   # inductive step, with ellipsis notation meaning slices of L
   # from the beginning and to the end, respectively
   if (x &lt; L[middleIndex]):
      return binarySort(L[...middleIndex-1], x, &lt;)
   else:
      return binarySort(L[middleIndex+1...], x, &lt;)

Colloquially, this is the optimal strategy in a game of “guess the number,” where the guesser is told if her guess is correct, too high, or too low. Try the middle number in the range of possible numbers. If the guess is too high, try the number which is 1/4th in the ordering, otherwise try 3/4ths, continuing this process until the number is guessed. This algorithm is obviously made for recursion (and for those advanced programmers, we resign to hope our working language supports tail-call optimization).

Binary search’s runtime is rather easy to analyze. At each step of the algorithm, we either finish, or cut the problem size in half. By no stretch of the imagination, binary search runs in $ O(\log n)$ where $ n$ is the length of the input list. In other words, in at worst $ \log_2 n$ steps, we will reduce the input list to have size 1, and can definitively say whether the list as a whole contains the sought element.

Notice that the success of the algorithm depends on the fact that the list is sorted, but the specific total order $ <$ is irrelevant. We will investigate this idea further, but first we need some deeper maths.

Well and Total Orders

For those of us who aren’t set theorists, it isn’t fair to talk about total orders and well orders without defining them. So here comes a definition:

Definition: A strict total order $ <$ is a relation on a set $ S$ with the following statements holding for all $ a, b, c \in S$:

  • It is never the case that $ a < a$. (anti-reflexivity)
  • Exactly one of the following are true: $ a < b, b < a,$ or $ a = b$. (trichotomy)
  • If $ a < b$ and $ b < c$, then $ a < c$. (transitivity)

These conditions create an environment in which sorting is possible: we need to be able to compare any two elements (trichotomy), we need to be able to inspect two elements at a time and know that our analysis generalizes to the whole list (transitivity), and we can’t break the world of strictness (anti-reflexivity).

Aside: The use of a strict total order as opposed to a non-strict total order is irrelevant, because each strict total order corresponds bijectively to a non-strict total order. Hence, there are two equivalent formulations of sorting with respect to strict and non-strict total orders, and we may choose one arbitrarily.

Now, we may elevate a total order $ <$ to a well order if every non-empty subset $ R \subset S$ has a least element with respect to $ <$. We computer scientists only sort finite lists, so every total order is automatically a well order. However, the reader may be interested in the mathematical need for such a distinction, so here it is:

Consider the integers $ \mathbb{Z}$ with the standard ordering $ <$. While $ \mathbb{Z}$ itself has no smallest element, neither does any subset which has infinitely many negative numbers, such as the evens or odds. More generally, any open interval in the real numbers $ \mathbb{R}$ obviously doesn’t have a least element with respect to the natural order. In contrast, we rely on the crucial fact that the set of natural numbers $ \mathbb{N}$ is well-ordered to apply mathematical induction.

Interestingly enough, a theorem due to Ernst Zermelo states that every set can be well ordered, and it is equivalent to the Axiom of Choice. While many people have a hard time visualizing a well-ordering of the real numbers $ \mathbb{R}$, we simply resign (as mystical as it is) to admit there is one out there somewhere, though we may never know what it is.

As another aside, it turns out that we only need one of the inequalities in $ (<, \leq, >, \geq)$ and the standard logical operations and (infix &&), or (infix ||), and not (prefix !) in order to define the other three (and indeed, to define $ =, \neq$ as well). This is a computer science trick that comes in handy, as we’ll see later, so here is the described equivalence. Given $ <$, we define the remaining operations as follows:

  • $ x > y$ is equivalent to $ y < x$
  • $ x \geq y$ is equivalent to $ !(x < y)$
  • $ x \leq y$ is equivalent to $ !(y < x)$
  • $ x = y$ is equivalent to $ !(x < y) \textup{ and } !(y < x)$
  • $ x \neq y$ is equivalent to $ x < y \textup{ or } y < x$

So if we are interested in sorting a set via some procedure, all we need from the user is the $ <$ operation, and then we may compare any way our heart desires.

Defining a New Well Order

Consider a deck of cards which is initially sorted (with some arbitrary ordering on the suits), and then is “cut” at some arbitrary point and the bottom part is placed on top of the deck. We may simplify this “cut” operation to a list of numbers, say ten, and provide the following example of a cut:

(5,6,9,10,11,13,1,3,3,4)

To pick a standard working language, we say the “cut point” of this list is 5, not 4.

We have a few (naive) options to search through cut data: we may sort it with respect to the natural total order on $ \mathbb{N}$, and then search through it; we may stick the elements of the list into a hash set (a constant-time lookup table), and then query existence that way; or we may traverse the list element-by element looking for a particular value.

The problem with the first two methods, though they determine existence, is that they don’t allow us to know where the value is in the list. If this it not important, and we are searching many times (compared to the size of the list) on many different values, then a has table would be the correct choice. However, if we are searching only a few times, and need to know where the value is hidden, all three of the above approaches are slow and inelegant.

Enter well orders. You may have noticed that a cut list of numbers has a very simple well ordering in terms of the natural order on $ \mathbb{N}$. Verbally, if the two numbers are separated across the cut point, then the larger number is in fact the smaller number. Otherwise, we may appeal to the regular ordering. Here it is in pseudocode:

function lessThan(x, y):
   if (y &lt; cutPoint &lt;= x):
      return true
   else if (x &lt; cutPoint &lt;= y):
      return false
   else:
      return x &lt; y

And we may compress these if statements into a single condition:

function lessThan(cutPoint, x, y): 
   y &lt; cutPoint &lt;= x || (!(x &lt; cutPoint &lt;= y) &amp;&amp; x &lt; y)

function makeCutOrdering(cutPoint): 
   lambda x,y: lessThan(cutPoint, x, y)

So we have found that a cut list of numbers is in fact well ordered with respect to this new relation. Forget about preprocessing the data, we can just do a binary search using this new ordering! Here’s a Mathematica script that does just that. Here we assume constant-time list length calculations, and fast slicing (which Mathematica has). Note that list indexing and slicing has double square bracket [ [ ] ] syntax, while function application is single square brackets [ ]. It should look very similar to the earlier pseudocode for binary search.

eq[x_, y_, lt_] := !lt[x,y] &amp;&amp; !lt[y,x];

(* base case *)
BinarySearch[{oneElt_}, sought_, lt_] := eq[oneElt, sought, lt];

(* inductive step *)
BinarySearch[lst_List, sought_, lt_] :=
   Module[{size = Length[lst], midVal, midNdx},
      midNdx = Floor[size/2];
      midVal = lst[[midNdx]];
      If[eq[midVal, sought, lt],
         True,
         If[lt[sought, midVal],
            BinarySearch[lst[[ ;; midNdx - 1]], sought, lt]
            BinarySearch[lst[[midNdx + 1 ;; ]], sought, lt]
         ]
      ]
   ];

Notice that if we use the standard $ <$ (in Mathematica, the function Less), then the BinarySearch function reverts to a standard binary search. Marvelous! Now we have a reusable piece of code that searches through any well-ordered set, provided we provide the correct well order.

The lesson to take from this is know your data! If your input list is not sorted, but still structured in some way, then there’s a good chance it is sorted with respect to a non-natural total order. For example, many operating systems order filenames which end in numbers oddly (e.g. “file1”, “file11”, “file12”, “file2”), and in certain places in the world, financial calendars are ordered differently (In Australia, the fiscal year starts in July). So take advantage of that, and you’ll never need to write binary search again.