# Google’s Page Rank – The Final Product

## Dangling Nodes and Non-Uniqueness

Recall where we left off last time. Given a web $W$ with no dangling nodes, the link matrix for $W$ has 1 as an eigenvalue, and if the corresponding eigenspace has dimension 1, then any associated eigenvector gives a ranking of the pages in $W$ which is consistent with our goals.

The first problem is that if there is a dangling node, our link matrix has a column of all zeros, and is no longer column-stochastic. In fact, such a non-negative matrix which has columns summing to 1 or columns of all zeros is called column-substochastic. We cannot guarantee that 1 is an eigenvalue, with the obvious counterexample being the zero matrix.

Second, as we saw last time, webs which have disconnected subwebs admit eigenspaces of large dimension, and hence our derived rankings are not unique.

We will fix both of these problems in one fell swoop: by adjusting the link matrix to make all entries positive.

The motivation for this comes from the knowledge of a particular theorem, called the Perron-Frobenius Theorem. While the general statement says a great deal, here are the parts we need:

Theorem: Let $A$ be a positive, column-stochastic matrix. Then there exists a maximal eigenvalue $0 < \lambda \leq 1$ such that all other eigenvalues are strictly smaller in magnitude. Further, the eigenspace associated to $\lambda$ has dimension 1. This unique eigenvalue and eigenvector (up to scaling) are called the Perron eigenvalue and eigenvector.

We won’t prove this theorem, because it requires a lot of additional background. But we will use it to make life great. All we need is a positive attitude.

## A Drunkard’s Surf

Any tinkering with the link matrix must be done in a sensible way. Unfortunately, we can’t “add” new links to our web without destroying its original meaning. So we need to view the resulting link matrix in a different light. Enter probability theory.

So say you’re drunk. You’re surfing the web and at every page you click a random link. Suppose further that every page is just a list of links, so it’s not harder to find some than others, and you’re equally likely to pick any link on the site. If you continue surfing for a long time, you’d expect to see pages with lots of backlinks more often than those with few backlinks. As you sober up, you might realize that this is a great way to characterize how important a webpage is! You quickly write down the probabilities for an example web into a matrix, with each $i,j$ entry being the probability that you click on a link from page $j$ to go to page $i$.

This is a bit of drunken mathematical gold! We’ve constructed precisely the same link matrix for a web, but found from a different perspective. Unfortunately, after more surfing you end up on a page that has no links. You cannot proceed, so you randomly type in a URL into the address bar, and continue on your merry way. Soon enough, you realize that you aren’t seeing the same webpages as you did in the first walk. This random URL must have taken you to a different connected component of the web. Brilliant! Herein we have the solution to both of our problems: add in some factor of random jumping.

To do this, and yet maintain column-stochasticity, we need to proportionally scale the elements of our matrix. Let $A$ be our original link matrix, and $B$ be the $n \times n$ matrix with all entries $1/n$. Then form a new matrix:

$C = pB + (1-p)A, 0 \leq p \leq 1$

In words, $C$ has a factor of egalitarianism proportional to $p$. All we need to verify is that $C$ is still column-stochastic, and this is clear since each column sum looks like the following for a fixed $j$ (and $a_{i,j}$ denotes the corresponding entry of $A$):

$\sum \limits_{i=1}^n(\frac{p}{n} + (1-p)a_{i,j}) = p\sum \limits_{i=1}^n \frac{1}{n} + (1-p)\sum \limits_{i=1}^na_{i,j} = p + (1-p) = 1$

So, applying the Perron-Frobenius theorem to our new matrix (where the value $p$ becomes a parameter in need of tuning), there is a unique largest positive eigenvalue $\lambda \leq 1$ for this web with an eigenspace of dimension 1. Picking any eigenvector within that eigenspace and then normalizing it to a unit vector with positive entries, we have our ranking algorithm!

Aside: note that the assumption that the eigenvalue has all positive entries is not unfounded. This follows as a result of our new matrix $C$ being irreducible. The details of this implication are unnecessary, as the Perron-Frobenius theorem provides the positivity of the Perron eigenvector. Furthermore, all other eigenvectors (corresponding to any eigenvalues) must have an entry which is either negative or non-real. Hence, we only have one available eigenvector to choose for any valid ranking.

## Computing the Beast

A link matrix for a web of any reasonable size is massive. As of June 20th, 2011, Google has an indexed database of close to 46 billion web pages, and in its infancy at Stanford, PageRank was tested on a web of merely 24 million pages. Clearly, one big matrix cannot fit in the memory of a single machine. But before we get to the details of optimizing for scale, let us compute page ranks for webs of modest sizes.

The problem of computing eigenvectors was first studied around the beginning of the 1900s. The first published method was called the power method, and it involves approximating the limit of the sequence

$\displaystyle v_{n+1} = \frac{Cv_n}{||Cv_n||}$

for some arbitrary initial starting vector $v_0$. Intuitively, when we apply $C$ to $v$, it “pulls” $v$ toward each of its eigenvectors proportionally to their associated eigenvalues. In other words, the largest eigenvalue dominates. So an element of this sequence which has a high index will have been pulled closer and closer to an eigenvector corresponding to the largest eigenvalue (in absolute value), and hence approximate that eigenvector. Since we don’t want our vector to grow arbitrarily in size, we normalize it at each step. A more detailed analysis exists which formalizes this concept, but we only need to utilize it.

Under a few assumptions this sequence is guaranteed to converge. Specifically, we require that the matrix is real-valued, there exists a dominant eigenvalue, and the starting vector has a nonzero component in the direction of an eigenvector corresponding to the dominant eigenvalue. Luckily for us, existence is taken care of by the Perron-Frobenius theorem, real-valuedness by construction, and we may trivially pick an initial vector of all 1s. Here is the pseudocode:

C <- link matrix
v <- (1,1,...,1)

while true:
previous = v
v = dot(C, v)
v /= norm(v)
break if norm(v-previous) < epsilon

return v

The only thing we have to worry about is the inefficiency in computing $v = Cv$, when $C$ is a matrix that is dense, in the sense that there are very few 0’s. That means computing $Cv$ will take $\Theta(n^2)$ multiplications, even though there are far fewer links in the original link matrix $A$. For a reasonable web of 24 million pages, this is mind-bogglingly slow. So we will replace this line with its original derivation:

   v = (p/n)*v + (1-p)*dot(A,v)

With the recognition that there exist special algorithms for sparse matrix multiplication, this computation is much faster.

Finally, our choice of epsilon is important, because we have yet to speak of how fast this algorithm converges. According to the analysis (linked) above, we can say a lot more than big-O type runtime complexity. The sequence actually converges geometrically, in the sense that each $v_k$ is closer to the limit than $v_{k-1}$ by a factor of $r, 0 < r < 1$, which is proven to be $\frac{|\lambda_2|}{|\lambda_1|}$, the ratio of the second and first dominant eigenvalues. If $\lambda_2$ is very close to $\lambda_1$, then the method will converge slowly. However, according to the research done on PageRank, the value of $r$ usually sits at about 0.85, making the convergence rather speedy. So picking reasonably small values of epsilon (say, $10^{-10}$) will not kill us.

## Implementation

Of course, writing pseudocode is child’s play compared to actually implementing a real algorithm. For pedantic purposes we chose to write PageRank in Mathematica, which has some amazing visualization features for graphs and plots, and built-in computation rules for sparse matrix multiplication. Furthermore, Mathematica represents graphs as a single object, so we can have very readable code. You can download the entire Mathematica notebook presented here from this blog’s Github page.

The code for the algorithm itself is not even fifteen lines of code (though it’s wrapped here to fit). We make heavy use of the built in functions for manipulating graph objects, and you should read the Mathematica documentation on graphs for more detailed information. But it’s pretty self-explanatory, black-box type code.

rulesToPairs[i_ -> j_] := {i,j};
SetAttributes[rulesToPairs, Listable];

PageRank[graph_, p_] := Module[{v, n, prev, edgeRules, degrees,
edgeRules = EdgeRules[graph];
degrees = VertexCount[graph];
n = VertexCount[graph];

(* setting up the sparse array as a list of rules *)
Table[{pt[[2]],pt[[1]]} -> 1/degrees[[pt[[1]]]]],
{pt, rulesToPairs[edgeRules]}];

v = Table[1.0, {n}];
While[True,
prev = v;
v = (p/n) + (1-p)Dot[linkMatrix, v];
v = v/Norm[v];
If[Norm[v-prev] < 10^(-10), Break[]]
];

Return[Round[N[v], 0.001]]
];

And now to test it, we simply provide it a graph object, which might look like

Graph[{1->2, 2->3, 3->4, 4->2, 4->1, 3->1, 2->4}]

And it spits out the appropriate answer. Now, the output of PageRank is just a list of numbers between 0 and 1, and it’s not very easy to see what’s going on as you change the parameter $p$. So we have some visualization code that gives very pretty pictures. In particular, we set the size of each vertex to be its page rank. Page rank values are conveniently within the appropriate range for the VertexSize option.

visualizePageRanks[G_, p_] := Module[{ranks},
ranks = PageRank[G,p];
Show[
Graph[
EdgeRules[G],
VertexSize -> Table[i -> ranks[[i]], {i, VertexCount[G]}],
VertexLabels -> "Name",
]
]
];

And we have some random graphs to work with:

randomGraph[numVertices_, numEdges_] :=
RandomGraph[{numVertices, numEdges},
DirectedEdges -> True,
VertexLabels -> "Name",
ImagePadding -> 10]

Here’s the result for a random graph on 10 vertices and 30 edges, with $p = 0.25$.

The vertices are sized proportional to their page rank

Furthermore, using Mathematica’s neat (but slow) commands for animation, we can see what happens as we vary the parameter $p$ between zero and one:

Pretty neat!

Surprisingly enough (given that this is our first try implementing PageRank), the algorithm scales without issue. A web of ten-thousand vertices and thirty-thousand edges takes a mere four seconds on an Atom 1.6 GHz processor with a gig of RAM. Unfortunately (and this is where Mathematica starts to show its deficiencies) the RandomGraph command doesn’t support constructions of graphs with as few as 100,000 vertices and 200,000 edges. We leave it as an exercise to the reader to test the algorithm on larger datasets (hint: construct a uniformly distributed list of random rules, then count up the out-degrees of each vertex, and modify the existing code to accept these as parameters).

To give a better idea of how the algorithm works with respect to varying parameters, we have the following two graphs. The first is a runtime plot for random graphs where the number of vertices is fixed at 100 and the number of edges varies between 100 and 1000. Interestingly enough, the algorithm seems to run quickest when there are close to twice the number of edges as there are vertices.

The graph dips around 200 edges, twice the number of vertices

Next, we investigate the effects of varying $p$, the egalitarianism factor, between 0 and 1 for random graphs with 100 vertices and 200 edges. Unsurprisingly, the runtime is fastest when we are completely egalitarian, and $p$ is close to 1.

The more egalitarian we are willing to be, the faster the ranking is performed.

Google reportedly used $p = 0.15$, so there probably were not significant gains in performance from the tuning of that parameter alone. Further, the structure of the web is not uniform; obviously there are large link hubs and many smaller sites which have relatively few links. With a bit more research into the actual density of links in the internet, we could do much better simulations. However, this sort of testing is beyond the scope of this blog.

So there you have it! A fully-functional implementation of PageRank, which scales as well as one could hope a prototype to. Feel free to play around with the provided code (assuming you don’t mind Mathematica, which is honestly a very nice language), and comment with your findings!

Next time we’ll wrap up this series with a discussion of the real-world pitfalls of PageRank. We will likely stray away from mathematics, but the consequences of such a high-profile ranking algorithm is necessary for completeness.

# Google’s PageRank – A First Attempt

## The Web as a Graph

The goal of this post is to assign an “importance score” $x_i \in [0,1]$ to each of a set of web pages indexed $v_i$ in a way that consistently captures our idea of which websites are likely to be important.

But before we can extract information from the structure of the internet, we need to have a mathematical description of that structure. Enter graph theory.

Definition: A web $W$ is a directed graph $(V, E, \varphi)$ with web pages $v_i \in V$, hyperlinks $e_i \in E$, and $\varphi : E \to V \times V$ providing incidence structure.

For the remainder of this post, a web will have $n$ pages, indexing from 1. Incoming links are commonly called backlinks, and we define the number of incoming directed edges as the in-degree of a vertex, here denoted $\textup{in}(v_i)$. Similarly, the out-degree is the number of outgoing edges, denoted $\textup{out}(v_i)$.

This is a very natural representation of the internet. Pages link to each other in a one-directional way. Our analysis of importance within a web will rely on a particular axiom (which we will later find dubious), that content creators usually link to websites they find important. With that in mind, the importance score of a web page should have something to do with the density of incoming links.

Of course, for simple webs this characterization is obvious. We give two examples of webs, the first of which has an obvious ranking, while the second is more ambiguous.

A web in which the rankings are obvious. Page 1 wins

A trickier web to rank. Page 1 appears to win.

In the first web, page 1 obviously is the most important, with pages 2,4, and 5 about the same, and poor page 3 bringing up the rear. In the second, it is conceivable that there is a sensible ranking, but it is much harder to see visually.

## Crawl, then Walk

As per our above discussion, our first and most naive ranking would be to just count up the in-degree of each vertex, giving the following equation to compute a page’s importance score:

$x_i = \textup{in}(v_i)$

Unfortunately, this is not descriptive enough, and has at least one major pitfall: every link has equal value. The whole point of our ranking algorithm is to determine which websites we can trust for good content. Since “content” includes links to other content, a backlink from CNN World News should increase a page’s importance score more than a backlink from UselessJunk.com. Presently, our naive equation doesn’t capture this, but we can modify the algorithm above to reflect it. Letting $S_i$ be the set of indices of pages with links to $v_i$, we have

$\displaystyle x_i = \sum\limits_{j \in S_i} x_j$

Momentarily accepting its self-referential nature (we can’t know $x_j$ without already knowing $x_i$), we can compute the importance score of any page as just the sum of the scores of pages which link to it. This still needs some tweaking, because here a page with a high importance score can gain too much influence simply by having a lot of links. Thus, we weight each term of the sum by the out-degree of the corresponding vertex, arriving at:

$\displaystyle x_i = \sum\limits_{j \in S_i} \dfrac{x_j}{\textup{out}(v_j)}$

So here we have a voting system where votes correspond to links within a web, and if we can find a solution to this system of equations, we will have a sound set of rankings.

Let us do an extended example, using the second example web above (the trickier one), which we will call $W$. If we write out the equation for each page $v_i$, we get

$\begin{matrix} x_1 & = & 0x_1 & + & \frac{1}{2}x_2 & + & \frac{1}{2}x_3 & + & \frac{1}{3}x_4 \\ x_2 & = & 0x_1 & + & 0x_2 & + & 0x_3 & + & \frac{1}{3}x_4 \\ x_3 & = & 0x_1 & + & \frac{1}{2}x_2 & + & 0x_3 & + & \frac{1}{3}x_4 \\ x_4 & = & 1 x_1 & + & 0x_2 & + & \frac{1}{2}x_3 & + & 0x_4\end{matrix}$

This is a system of four equations with four unknowns, and so we may either solve it or prove its inconsistency. Enter linear algebra. Rewriting the system in matrix form, we wish to find a solution $\mathbf{x} = (x_1, x_2, x_3, x_4)$ to the equation $A \mathbf{x} = \mathbf{x}$, where $A$, which we call the link matrix for $W$, is as below.

$A = \begin{pmatrix} 0 & \frac{1}{2} & \frac{1}{2} & \frac{1}{3} \\ 0 & 0 & 0 & \frac{1}{3} \\ 0 & \frac{1}{2} & 0 & \frac{1}{3} \\ 1 & 0 & \frac{1}{2} & 0 \end{pmatrix}$

Look at that! Our problem of ranking pages in this web has reduced to finding an eigenvector for $A$ corresponding to the eigenvalue 1. This particular matrix just so happens to have 1 as an eigenvalue, and the corresponding eigenvector is $(\frac{3}{4}, \frac{1}{3},\frac{1}{2},1)$. This solution is unique up to scaling, because the eigenspace for 1, which we denote $E_1$ is one-dimensional.

Before we continue, note that this solution is counter-intuitive: page 1 had the most incoming links, but is second in the ranking! Upon closer examination, we see that page 1 votes only for page 4, transferring its entire importance score to page 4. So along with page 3’s vote, page 4 is rightfully supreme.

The astute reader might question whether every link matrix has 1 as an eigenvalue. Furthermore, the eigenspace corresponding to 1 might have large dimension, and hence admit many different and irreconcilable rankings. For now we will sidestep both of these frightening problems with sufficiently strong hypotheses, deferring a courageous and useful solution to next time.

## Assume the Scary Thoughts Away

We begin by noticing that the link matrix in the example above has non-negative entries and columns which sum to 1. Such a matrix is called column-stochastic, and it has fascinating properties. It is easy to see that if every page in a web has an outgoing link (there are no dangling nodes in the web), then the link matrix for that web is column-stochastic. We make use of this observation in the following theorem:

Theorem: The link matrix $A$ for a web $W$ with no dangling nodes has 1 as an eigenvalue.

Proof. Recall that $A$ and $A^{\textup{T}}$ have identical eigenvalues (there are many ways to prove this, try it as an exercise!). Let $\bf{x} = (1,1, \dots , 1)$, and we see that $A^{\textup{T}} \bf{x}$ has as its entries the sums of the rows of $A^{\textup{T}}$, which are in turn the sums of the columns of $A$. Since $A$ is column-stochastic, each entry of $A \bf{x}$ is 1, and the theorem is proved. $\square$

So we have proven that excluding dangling nodes, the link matrix or any web has a useful ranking. Unfortunately, it is not the case that these rankings are unique up to scaling. In other words, it is not the case that every column-stochastic matrix has $\textup{dim}(E_1)=1$. Consider the following link matrix:

$\begin{pmatrix} 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & \frac{1}{2} & \frac{1}{2} \\ 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ 0 & 0 & \frac{1}{2} & \frac{1}{2} & 0 \end{pmatrix}$

We have two linearly independent eigenvectors $(1,1,0,0,0), (0,0,1,1,1)$ both in $E_1$, and so any linear combination of them is also a vector in $E_1$. It’s not clear which, if any, we should use. Fortunately, this ambiguity exists for a good reason: the web corresponding to this link matrix has two disconnected subwebs. In real-world terms, there is no common reference frame in which to judge page 1 against page 3, so our solution space allows us to pick any frame (linear combination) we wish. In fact, it is not difficult to prove that any web which has $m$ disconnected subwebs will admit $\textup{dim}(E_1) \geq m$ (try it as an exercise!). In other words, we can’t prove our way to a unique ranking without some additional tweaking of the link matrix construction. Next time we will do just that.

So at this point we have solved a very restricted version of the ranking problem: we can provide sensible rankings for webs which have no dangling nodes and for which the eigenspace corresponding to the eigenvalue 1 happens to have dimension 1. Of course, in the real world, the internet has very many dangling nodes, and many disconnected subwebs. The algorithm is a home without a roof as it is (no better than a cave, really). As interesting as it is, our work so far has just been the scaffolding for the real problem.

So next time, we’ll end with a finished PageRank algorithm, code and all. Until then!

## Importance on the Web

As a society living in the “Information Age,” it comes as no surprise that we are faced with the task of sorting through vast oceans of content. With the admission that most content is actually junk, we must wisely choose the objects of our analysis. The appropriately named site UselessJunk.com certainly doesn’t deserve the same attention as the BBC World News page, and yet within the monstrous heart of the internet, it requires the maturity of the human psyche to discriminate their relative worth.

However, as with many real-world systems, there is a method to the madness. We have the following two important observations to inspire our quest:

• The internet is a human construction, and hence
• The internet intrinsically reflects some level of human interest in its content.

If the second statement is true, that the internet somehow reflects what information people want to absorb (or as is often these days, games they want to play, videos they want to watch, restaurants they want to find), then all we need to do is extract this information without human aid. In other words, by first picking out information that is generally desirable, we can reduce the black hole of available (and mostly useless) information to a manageable size for human consumption.

## Search Engines

From a much sharper perspective, information selection is precisely what Google does for us every day. If there’s something we’d like to know more about, we type in the relevant search keywords, and out comes a list of pages sorted (more or less) in order by relevance. Of course, “relevance” is not the correct term, but rather “popularity,” and hence “authority,” but we will explain why this characterization makes more sense later. For now, we’ll provide an overview of what a search engine does:

A search engine has four main parts:

• A crawler
• An indexer
• A ranking algorithm, and
• A query engine

A web crawler methodically copies pages from the web into a database, keeping track of hyperlinks between pages. An indexer then revisits these copies, determining relevant keywords to optimize their later retrieval (leaning on immense amounts of work in cognitive psychology, linguistics, and mathematics). The pages are then ranked according to a particular ranking algorithm. And finally the user is provided with a query engine (the search bar) to access these records, which are displayed in order according to the ranking algorithm.

While each part above is a fascinating problem in itself, we will focus primarily on the third: the ranking algorithm. In its infancy, Google’s novel approach rested in its ranking algorithm, affectionately named PageRank after co-founder Larry Page (other co-founder Sergey Brin). Though it is not presented here in a readable way, this is (a condensed version of) the original paper presenting the Google search engine. Of course, Larry and Sergey needed much more elbow grease to make Google as slick as it is today, and likewise the paper delves into details on local data storage, parsing, chunking, and all sorts of error-handling mechanisms that are beyond the scope of this series. We will stick to investigating and proving the validity of the mathematical model underlying PageRank itself.

And so, the predominant questions throughout this series will be:

• Which websites on the internet are worth my time? and, more importantly,
• How can we extract this from the structure of the internet?

The answer to the second question lies in a jungle of fantastically interconnected maths. First, we’ll represent the structure of the web as a directed graph. Then, we’ll compute importance of every web page simultaneously by solving a (very large) system of linear equations. While this sounds straightforward enough, there will be a number of bumps along the way to a satisfactory solution, and we will derive each twist and turn to bask in the elegance of the process.

For those of you who are unfamiliar with graph theory and linear algebra, we plan to provide additional (terse) primers for the necessary definitions and basic notions. Otherwise, look forward to the next part in this series, where we’ll construct PageRank for an internet with some strong hypotheses restricting its form. Until then!