Hunting Serial Killers

dexter-season-7-slice.jpg

“Tonight’s the Night”

A large volume of research goes into the psychological and behavioral analysis of criminals. In particular, serial criminals hold a special place in the imagination and nightmares of the general public (at least, American public). Those criminals with the opportunity to become serial criminals are logical, cool-tempered, methodical, and, of course, dangerous. They walk among us in crowded city streets, or drive slowly down an avenue looking for their next victim. They are sometimes neurotic sociopaths, and other times amicable, charming models of society and business. But most of all, they know their craft well. They work slowly enough to not make mistakes, but fast enough to get the job done and feel good about it. Their actions literally change lives.

In other words, they would be good programmers. If only they all hadn’t given up trying to learn C++!

In all seriousness, a serial killer’s rigid methodology sometimes admits itself nicely to mathematical analysis. For an ideal serial criminal (ideal in being analyzable), we have the following two axioms of criminal behavior:

  1. A serial criminal will not commit crimes too close to his base of operation.
  2. A serial criminal will not travel farther than necessary to find victims.

The first axiom is reasonable because a good serial criminal does not want to arouse suspicion from his neighbors. The second axiom roughly describes an effort/reward ratio that keeps serial offenders from travelling too far away from their homes.

These axioms have a large amount of criminological research behind them. While there is little unifying evidence (the real world is far too messy), there are many bits and pieces supporting these claims. For example, the frequency of burglaries peak about a block from the offender’s residence, while almost none occur closer than a block (Turner, 1969, “Delinquency and distance”). Further, many serial rapists commit subsequent rapes (or rather, abductions preceding rape) within a half mile from the previous (LeBeau, 1987, “Patterns of stranger and serial rape offending”). There are tons of examples of these axioms in action in criminology literature.

On the other hand, there are many types of methodical criminals who do not agree with these axioms. Some killers murder while traveling the country, while others pick victims with such specific characteristics that they must hunt in a single location. So we take the following models with a grain of salt, in that they only apply to a certain class of criminal behavior.

With these ideas in mind, if we knew a criminal’s base of operation, we could construct a mathematical model of his “buffer zone,” inside of which he does not commit crime. With high probability, most of his crimes will lie just outside the buffer zone. This in itself is not useful in the grand scheme of crime-fighting. If we know a criminal’s residence, we need not look any further. The key to this model’s usefulness is working in reverse: we want to extrapolate a criminal’s residence from the locations of his crimes. Then, after witnessing a number of crimes we believe to be committed by the same person, we may optimize a search for the offender’s residence. We will use the geographic locations of a criminal’s activity to accurately profile him, hence the name, geoprofiling.

Murder, She Coded

Historically, the first geoprofiling model was crafted by a criminologist named Dr. Kim Rossmo. Initially, he overlaid the crime locations on a sufficiently fine $ n \times m$ grid of cells. Then, he uses his model to calculate the probability of the criminal’s residence lying within each cell. Rossmo’s formula is displayed below, and explained subsequently.

$ \displaystyle P(x) = \sum \limits_{\textup{crime locations } c} \frac{\varphi}{d(x,c)^f} + \frac{(1-\varphi)B^{f-g}}{(2B-d(x,c))^g}$,

where $ \varphi = 1$ if $ d(x,c) > B, 0$ otherwise.

Here, $ x$ is an arbitrary cell in the grid, $ d(x,c)$ is the distance from a cell to a crime location, with some fixed metric $ d$. The variable $ \varphi$ determines which of the two summands to nullify based on whether the cell in question is in the buffer zone. $ B$ is the radius of the buffer zone, and $ f,g$ are formal empirically tuned parameters. Variations in $ f$ and $ g$ change the steepness of the decay curve before and after the buffer radius. We admit to have no idea why they need to be related, and cannot find a good explanation in Rossmo’s novel of a dissertation. Instead, Rossmo claims both parameters should be equal. For the purposes of this blog we find their exact values irrelevant, and put them somewhere between a half and two thirds.

This model reflects the inherent symmetry in the problem. If we may say that an offender commits a crime outside a buffer of some radius $ B$ surrounding his residence, then we may also say that the residence is likely outside a buffer of the same radius surrounding each crime! For a fixed location, we may compute the probability of the offender’s residence being there with respect to each individual crime, and just sum them up.

This equation, while complete, has a better description for programmers, which is decidedly easier to chew in small bites:

Let d = d(x,c)
if d > B:
   P(x) += 1/d^f
else: 
   P(x) += B^(f-g)/(2B-d)^g

Then we may simply loop this routine over over all such $ c$ for a fixed $ x$, and get our probability. Here we see the ideas clearly, that outside the buffer zone of the crime the probability of residence decreases with a power-law, and within the buffer zone it increases approaching the buffer.

Now, note that these “probabilities” are not, strictly speaking, probabilities, because they are not normalized in the unit interval $ [0,1]$. We may normalize them if we wish, but all we really care about are the relative cell values to guide our search for the perpetrator. So we abuse the term “high probability” to mean “relatively high value.”

Finally, the distance metric we actually use in the model is the so-called taxicab metric. Since this model is supposed to be relevant to urban serial criminals (indeed, where the majority of cases occur), the taxicab metric more accurately describes a person’s mental model of distance within a city, because it accounts for roadways. Note that in order for this to work as desired, the map used must be rotated so that its streets lie parallel to the $ x,y$ axes. We will assume for the rest of this post that the maps are rotated appropriately, as this is a problem with implementation and not the model itself or our prototype.

Rossmo’s model is very easy to implement in any language, but probably easiest to view and animate in Mathematica. As usual, the entire program for the examples presented here is available on this blog’s Github page. The decay function is just a direct translation of the pseudocode:

rossmoDecay[p1_, p2_, bufferLength_, f_, g_, distance_] :=
  With[{d = distance[p1, p2]},
   If[d > bufferLength,
    1/(d^f),
    (bufferLength^(g - f))/(2 bufferLength - d)^g]];

We then construct a function which computes the decay from a fixed cell for each crime site:

makeRossmoFunction[sites_, buffer_, f_, g_] :=
  Function[{x, y},
   Apply[Plus,
    Map[rossmoDecay[#,{x,y},buffer,f,g,ManhattanDistance] &,
     sites]]];

Now we may construct a “Rossmo function,” (initializing the parameters of the model), and map the resulting function over each cell in our grid:

Array[makeRossmoFunction[sites, 14, 1/3, 2/3], {60, 50}];

Here the Array function accepts a function $ f$, and a specification of the dimensions of the array. Then each array index tuple is fed to $ f$, and the resulting number is stored in the $ i,j$ entry of the array. Here $ f: \mathbb{Z}_{60} \times \mathbb{Z}_{50} \to \mathbb{R}^+$. We use as a test the following three fake crime sites:

sites = {{20, 25}, {47, 10}, {55, 40}};

Upon plotting the resulting array, we have the following pretty picture:

A test of Rossmo’s geographic profiling model on three points.

Here, the crime locations are at the centers of each of the diamonds, and cells with more reddish colors have higher values. Specifically, the “hot spot” for the criminal’s residence is in the darkest red spot in the bottom center of the image.

As usual, in order to better visualize the varying parameters, we have the following two animations:

A variation of the “f” parameter from 0.1 and 1.25 in steps of 0.05. “g” is fixed at 2/3.

Variation in the “g” parameter from 0.1 to 1.25 in steps of 0.05. “f” is fixed at 1/2.

Variation in the $ B$ parameter simply increases or decreases the size of the buffer zone. In both animations above we have it fixed at 14 units.

Despite the pretty pictures, a mathematical model is nothing without empirical evidence to support it. Now, we turn to an analysis of this model on real cases.

“Excellent!” I cried. “Elementary mathematics,” said he.

Richard Chase

The first serial killer we investigate is Richard Chase, also known as the Vampire of Sacramento. One of the creepiest murderers in recent history, Richard Chase believed he had to drink the blood of his victims in order to live. In the month of January 1978, Chase killed five people, dumping their mutilated bodies in locations near his home.

Before we continue with the geographic locations of this particular case, we need to determine which locations are admissible. For instance, we could analyze abduction sites, body drop sites, locations of weapons caches or even where the perpetrator’s car was kept. Unfortunately, many of these locations are not known during an investigation. At best only approximate abduction sites can be used, and stash locations are usually uncovered after an offender is caught.

For the sake of the Chase case, and subsequent cases, we will stick to the most objective data points: the body drop sites. We found this particular data in Rossmo’s dissertation, page 272 of the pdf document. Overlaid on a 30 by 30 grid, they are:

richardChaseSites =
   {{3, 17}, {15, 3}, {19, 27}, {21, 22}, {25, 18}};
richardChaseResidence = {19,17};

Then, computing the respective maps, we have the following probability map:

The Rossmo probability map for the Richard Chase body drop sites. Here B = 5, f = 1/2, g = 1

If we overlay the location of Chase’s residence in purple, we see that it is very close to the hottest cell, and well-within the hot zone. In addition, we compare this with another kind of geoprofile: the center of gravity of the five sites. We color the center of gravity in black, and see that it is farther from Chase’s residence than the hot zone. In addition, we make the crime sites easy to see by coloring them green.

Additional data points: center of gravity in black, Chase’s residence in purple, and crime sites in green.

Albert DeSalvo

This is a great result for the model! Let us see how it fares on another case: Albert DeSalvo, the Boston strangler.

With a total of 13 murders and being suspected of over 300 sexual assault charges, DeSalvo is a prime specimen for analysis. DeSalvo entered his victim’s homes with a repertoire of lies, including being a maintenance worker, the building plumber, or a motorist with a broken-down car. He then proceeded to tie his victims to a bed, sexually assault them, and then strangle them with articles of clothing. Sometimes he tied a bow to the cords he strangled his victims with.

We again use the body drop sites, which in this case are equivalent to encounter sites. They are:

deSalvoSites = {{10, 48}, {13, 8}, {15, 11}, {17, 8}, {18, 7},
  {18, 9}, {19, 4}, {19, 8}, {20, 9}, {20, 10}, {20, 11},
  {29, 23}, {33, 28}};
deSalvoResidence = {19,18};

Running Rossmo’s model again, including the same extra coloring as for the Chase murders, we get the following picture:

The Rossmo probability map for Albert DeSalvo’s murders. Here B=10, f= 1/2, g = 1.

Again, we win. DeSalvo’s residence falls right in the darker of our two main hot zones. With this information, the authorities would certainly apprehend him in a jiffy. On the other hand, the large frequency of murders in the left-hand side pulls the center of gravity too close. In this way we see that the center of gravity is not a good “measure of center” for murder cases. Indeed, it violates the buffer principle, which holds strong in these two cases.

Peter Sutcliffe

Finally, we investigate Peter Sutcliffe, more infamously known as the Yorkshire Ripper. Sutcliffe murdered 13 women and attacked at least 6 others between 1975 and 1980 in the county of West Yorkshire, UK. He often targeted prostitutes, hitting them over the head with a hammer and proceeding to sexually molest and mutilate their bodies. He was finally caught with a prostitute in his car, but was not initially thought to be the Yorkshire Ripper until after police returned to the scene of his arrest to look for additional evidence. They found his murder weapons, and promptly prosecuted him.

We list his crime locations below. Note that these include body drop sites and the attack sites for non-murders, which were later reported to the police.

sutcliffeSites = {{5, 1}, {8, 7}, {50, 99}, {53, 68}, {56, 72},
  {59, 59}, {62, 57}, {63, 85}, {63, 87}, {64, 83}, {69, 82},
  {73, 88}, {80, 88}, {81, 89}, {83, 88}, {83, 87}, {85, 85},
  {85, 83}, {90, 90}};
sutcliffeResidences = {{60, 88}, {58, 81}};

Notice that over the course of his five-year spree, he lived in two residences. One of these he moved to with his wife of three years (he started murdering after marrying his wife). It is unclear whether this changed his choice of body drop locations.

Unfortunately, our attempts to pinpoint Sutcliffe’s residence with Rossmo’s model fail miserably. With one static image, guessing at the buffer radius, we have the following probability map:

Failure in the form of a probability map.

As we see, both the center of gravity and the hot zones are far from either of Sutcliffe’s residences. Indeed, even with a varying buffer radius, we are still led to search in unfruitful locations:

An animation of the buffer radius parameter B varying between 1 and 50. Clearly no buffer will give us the desired probability map. Poop.

Even with all of the axioms, all of the parameters, all of the gosh-darn work we went through! Our model is useless here. This raises the obvious question, exactly how applicable is Rossmo’s model?

The Crippling Issues

The real world is admittedly more complex than we make it out to be. Whether the criminal is misclassified, bad data is attributed, or the killer has some special, perhaps deranged motivation, there are far too many opportunities for confounding variables to tamper with our results. Rossmo’s model even requires that the killer live in a more or less central urban location, for if he must travel in a specific direction to find victims, he may necessarily produce a skewed distribution of crime locations.

Indeed, we have to have some metric by which to judge the accuracy of Rossmo’s model. While one might propose the distance between the offender’s residence and the highest-probability area produced on the map, there are many others. In particular, since the point of geographic profiling is to prioritize the search for a criminal’s residence, the best metric is likely the area searched before finding the residence. We call this metric search area. In other words, search area is the amount of area on the map which has probability greater or equal to the cell containing the actual residence. Indeed Rossmo touts this metric as the only useful metric.

However, according to his own tests, the amount of area searched on the Sutcliffe case would be over a hundred square miles! In addition, Rossmo neither provides an idea of what amount of area is feasibly searchable, nor any global statistics on what percentage of cases in his study resulted in an area that was feasibly searchable. We postulate our own analysis here.

In a count of Rossmo’s data tables, out of the fifteen individual cases he studied, the average search area was 395 square kilometers, or 152.5 square miles, while the median was about 87 square kilometers, or 33.6 square miles. The maximum is 1829 square kilometers, while the min is 0.2 square kilometers. The complete table is contained in the Mathematica notebook on this blog’s Github page.

From the 1991 census data for Vancouver, we see that a low density neighborhood has an average population of 2,380 individuals per square kilometer, or about 6,000 per square mile. Applying this to our numbers from the previous paragraph, we have a mean of 940,000 people investigated before the criminal is found, a median of 200,000, a max of four million (!), and a min of 309.

Even basing our measurements on the median values, this method appears to be unfeasible as a sole means of search prioritization. Of course, real investigations go on a lot more data, including hunches, to focus search. At best this could be a useful tool for police, but on the median, we believe it would be marginally helpful to authorities prioritize their search efforts. For now, at least, good ol’ experience will likely prevail in hunting serial killers.

In addition, other researchers have tested human intuition at doing the same geographic profiling analysis, and they found that with a small bit of training (certainly no more than reading this blog post), humans showed no significant difference from computers at computing this model. (English, 2008) Of course, for the average human the “computing” process (via pencil and paper) was speedy and more variable, but for experienced professionals the margin of error would likely disappear. As interesting as this model may be, it seems the average case is more like Sutcliffe than Chase; Rossmo’s model is effectively a mathematical curiosity.

It appears, for now, that our friend Dexter Morgan is safe from the threat of discovery by computer search.

Alternative Models

The idea of a decay function is not limited to Rossmo’s particular equation. Indeed, one might naturally first expect the decay function to be logarithmic, normal, or even exponential. Indeed, such models do exist, and they are all deemed to be roughly equivalent in accuracy for appropriately tuned parameters. (English, 2008) Furthermore, we include an implementation of a normal growth/decay function in the Mathematica notebook on this blog’s Github page. After reading that all of these models are roughly equivalent, we did not conduct an explicit analysis of the normal model. We leave that as an exercise to the reader, in order to become familiar with the code provided.

In addition, one could augment this model with other kinds of data. If the serial offender targets a specific demographic, then this model could be combined with demographic data to predict the sites of future attacks. It could be (and in some cases has been) weighted according to major roadways and freeways, which reduce a criminals mental model of distance to a hunting ground. In other words, we could use the Google Maps “shortest trip” metric between any two points as its distance metric. To our knowledge, this has not been implemented with any established mapping software. We imagine that such an implementation would be slow; but then again, a distributed network of computers computing the values for each cell in parallel would be quick.

Other Uses for the Model

In addition to profiling serial murders, we have read of other uses for this sort of geographic profiling model.

First, there is an established paper on the use of geographic profiling to describe the hunting patterns of great white sharks. Briefly, we recognize that such a model would switch from a taxicab metric to a standard Euclidean metric, since the movement space of the ocean is locally homeomorphic to three-dimensional Euclidean space. Indeed, we might also require a three-dimensional probability map for shark predation, since sharks may swim up or down to find prey. Furthermore, shark swimming patterns are likely not uniformly random in any direction, so this model is weighted to consider that.

Finally, we haphazardly propose additional uses for this model: pinpointing the location of stationary artillery, locating terrorist base camps, finding the source of disease outbreaks, and profiling other minor serial-type criminals, like graffiti vandalists.

Data! Data! My Kingdom for Some Data!

As recent as 2000, one researcher noted that the best source of geographic criminal data was newspaper archives. In the age of information, and given the recent popularity of geographic profiling research, this is a sad state of being. As far as we know, there are no publicly available indexes of geographic crime location data. As of the writing of this post, an inquiry to a group of machine learning specialists has produced no results. There doesn’t seem to be such a forum for criminology experts.

If any readers have information to crime series data that is publicly available on the internet (likely used in some professor’s research and posted on their website), please don’t hesitate to leave a comment with a link. It would be greatly appreciated.

Google’s Page Rank – Why it Doesn’t Work Anymore

A Bully By Any Other Name

From the New York Times:

“Shopping online in late July, Clarabelle Rodriguez typed the name of her favorite eyeglass brand into Google’s search bar.

In moments, she found the perfect frames — made by a French company called Lafont — on a Web site that looked snazzy and stood at the top of the search results. Not the tippy-top, where the paid ads are found, but under those, on Google’s version of the gold-medal podium, where the most relevant and popular site is displayed.

Ms. Rodriguez placed an order for both the Lafonts and a set of doctor-prescribed Ciba Vision contact lenses on that site, DecorMyEyes.com. The total cost was $361.97.

It was the start of what Ms. Rodriguez would later describe as one of the most maddening and miserable experiences of her life…” [continue reading]

For those without the patience to read the seven page article, the owner of DecorMyEyes.com, Vitaly Borker, deliberately mischarged his clients, and then bullied any who complained. Constantly dishing out vulgar threats, Borker committed wire-fraud, impersonation, and stalking as part of his business strategy.

Of course, every angry customer went directly to online web forums and business review companies to complain. Unbeknownst to them, this was Borker’s hope. Every review posted about DecorMyEyes.com, no matter how negative, was another backlink to boost the site’s page rank. After enough negative press, sunglasses product pages from DecorMyEyes.com soon showed up even higher on search results than the websites of their designers!

Shortly after the New York Times’s article stirred up a serious conversation about DecorMyEyes’s business practice, the police struck up an investigation, and Borker was promptly arrested.

The Honest Algorithm

This story illustrates a compelling point. Once Google released the information behind their ranking algorithm (of course, it happened before Google was even a company), people could take advantage of it! A staggeringly large number of services exist to boost your page rank (129 million results on Google search). And so the intended way to get a high page rank (others like your content and link to it) is superseded by some method of manufacturing backlinks.

Unfortunately for the rest of us, PageRank seems to be an honest algorithm. It only judges pages appropriately when the pages aren’t competing to be judged. Once Google became a popular option for search, rankings could make or break a start-up tech company. It was only the natural response to exploit PageRank and boost business, but of course this undermines the assumptions of the algorithm.

It is certainly obvious at this point that while PageRank may be a critical component to Google’s overall ranking algorithm, it is certainly not the only factor Google considers. Its plausible that Google has very many alternative ranking criteria which trump PageRank. Undoubtedly, Google was forced to come up with these criteria specifically to combat sites like DecorMyEyes.com, and identify manufactured links.

And so this opens the floor for discussion: what alternative ranking systems would you consider? Can you think of easy ways to identify these maliciously manufactured links? This seems in general to be a hard problem, unless the pages providing the additional links are blatantly obvious.

Other than a potential discussion in the comments, that wraps up the series on PageRank. We hope the readers have enjoyed it!

Page Rank Series
An Introduction
A First Attempt
The Final Product
Why It Doesn’t Work Anymore

Linear Algebra – A Primer

Story Time

Linear algebra was founded around the same time as Calculus (think Leibniz, circa 1700) solely for the purpose of solving general systems of linear equations. The coefficients of a system were written in a grid form, with rows corresponding to equations and columns to the unknown variables. Using a computational tool called the determinant (an awkward, but computable formula involving only the coefficients of the equations in a system), researchers were able to solve these systems, opening a world of information about the positions of celestial bodies and large-scale measurements (of geodesic arcs) on the surface of the earth.

By the 1850’s, Arthur Cayley was representing matrices as abstract objects. He defined matrix multiplication and nurtured matrix theory as its own field, recognizing a vast wealth of theoretical knowledge underlying the theory of determinants. Around turn of the century, a formal system of vector algebra was invented which relied heavily on interpreting matrices as so-called linear transformations. Linear transformations are intuitively those maps of everyday space ($ \mathbb{R}^n$) which preserve “linear” things. Specifically, they send lines to lines, planes to planes, etc., and they preserve the origin (one which does not preserve the origin is very similar but has a different name; see Affine Transformation). Soon enough the mathematical focus shifted to the foundations of such an algebra, and later with the advent of computers to rapid calculations in one.

Motivations

Linear algebra sits at the crossroads of many areas of mathematics. Keeping close to its roots, linear algebra is primarily a tool for computation. Unsurprisingly, a huge chunk of mathematical research has been solely to phrase things in terms of matrices and their associated linear transformations. For instance, an undirected graph on $ n$ vertices can be modeled as a matrix of integer entries, with the $ i,j$ entry containing the number of edges from vertex $ i$ to vertex $ j$. This is called the adjacency matrix of a graph. Suddenly, a wealth of information about the graph translates to simple matrix computations. For instance, we can compute the number of paths from one vertex to another of length $ m$ as the appropriate entry of $ A^m$. (more formally,these are walks, which are allowed to repeat edge traversals and visited vertices)

Even in advanced, purely theoretical mathematics, objects are commonly represented in terms of coordinates in some vector space, and are subsequently studied using all of the great things we know about linear transformations and their matrices. And so, without further ado, we will present the terminology and working concepts necessary for the content elsewhere in this blog.

Vector Spaces

The setting for all of linear algebra is in some vector space. Intuitively this is just a collection of objects, which we call vectors, with some rules on how you can combine vectors to get other vectors. This treatment wouldn’t do that idea justice without an axiomatic definition, so here it is.

Definition: A vector space is a quadruple $ (V, F, +, \cdot)$, where $ V$ is a set of vectors (points in our space), $ F$ is a scalar field (coefficients), $ +:V \times V \to V$ is a commutative, associative operation to combine vectors, and $ \cdot: F \times V \to V$ is an operation to “scale” vectors. In addition, we need the following properties to hold:

  • Addition and multiplication distribute (as we are used to with traditional algebra).
  • There must be an additive identity, which we call $ 0$, giving $ 0 + v = v$ for all $ v \in V$.
  • Every vector must have an additive inverse (every $ v$ has some $ w$ with $ v + w = 0$).

This is a lot to swallow at first, but it is general for a good reason: there are tons of different kinds of vector spaces! Many of these are surprising and counter-intuitive. For our purposes, however, we may stick with the nice, small vector spaces. So here is a simplified definition that will suffice:

Definition: vector space is a set $ V$ of vectors which are fixed-length lists of real numbers $ (v_1, v_2, \dots , v_n) \in \mathbb{R}^n$, where addition between vectors is componentwise, we may scale vectors by any real number, and the following properties hold:

  • Addition and multiplication distribute (as above).
  • $ (0,0,0, \dots, 0)$ is the additive identity.
  • $ (-v_1, -v_2, \dots , -v_n)$ is the unique additive inverse of $ (v_1, v_2, \dots , v_n)$.

Hopefully this is much more familiar to what we think of as “vectors,” and with the understanding that we are viewing it as a vector space, we just call it $ \mathbb{R}^n$. The closure of operations gives us a nice way to characterize “any combination” of vectors in a vector space.

Definition: A linear combination of vectors in a vector space $ V$ is the vector

$ a_1v_1 + a_2v_2 + \dots + a_kv_k$

for some positive integer $ k$, scalars $ a_i$, and vectors $ v_i$.

We may speak of the span of a set of vectors as the set of all possible linear combinations of those vectors. Furthermore, we call a set of vectors linearly independent if no vector in the list is in the span of the others. For example, $ (1,0,0), (0,1,0),$ and $ (0,0,1)$ are linearly independent in $ \mathbb{R}^3$. Specifically, $ (1,0,0)$ cannot be written as $ a(0,1,0) + b(0,0,1) = (0,a,b)$ for any scalars $ a,b \in F$, and the other two vectors are similarly so.

As usual, we may describe subspaces of a vector space, which are just subsets of $ V$ which are themselves vector spaces with the inherited operations. The simplest examples of these are lines, planes, and hyperplanes through the origin in $ \mathbb{R}^n$. Consequently, we may identify $ \mathbb{R}^n$ as a subspace of $ \mathbb{R}^m$ for any $ n \leq m$.

One of the first things we want to ask about a vector space is “how big is it?” While most instances of vector spaces we will see have uncountably many elements, we can characterize “size” in terms of a different metric: the size of a basis.

Definition: A list of vectors $ (v_1, v_2, \dots v_n)$ is a basis for $ V$ if its elements are linearly independent, and their span is $ V$. The dimension of a vector space is the length of any basis.

For $ \mathbb{R}^n$, and similarly all finite-dimensional vector spaces, it is easy to prove that all bases have the same length, and hence dimension is well-defined. Further, $ \mathbb{R}^n$ admits a very natural basis, often called the standard basis:

$ e_1 = (1,0, \dots, 0)$
$ e_2 = (0,1, \dots, 0)$
$ \vdots$
$ e_n = (0,0, \dots, 1)$

These are best visualized as the coordinate axes in $ \mathbb{R}^n$, and it strokes our intuition as to what a basis should be, because any vector in $ \mathbb{R}^n$ can be broken down uniquely into a sum of scalar multiples of these unit coordinates. Indeed, this is true of any basis (due to linear independence). Given a fixed basis for $ V$, every vector $ v \in V$ may be uniquely written as a linear combination of basis vectors.

Linear Transformations and their Matrix Representations

Moving quickly toward the heart of linear algebra, we may speak of linear transformations (interchangeably, linear maps) between two vector spaces:

Definition: A function $ f : V \to W$ is a linear map if it preserves the operations of addition and scalar multiplication. In other words, for all $ v, w \in V, c \in F, f(v+w) = f(v)+f(w)$ and $ f(cv) = cf(v)$.

Examples are bountiful; some geometrically inspired ones include rotations about the origin, shears, and scalings. These are functions you’d likely see in an image manipulation program like photoshop. From this we can prove a few basic facts, like that every linear map sends $ 0$ to $ 0$ and additive inverses to additive inverses (try it as an exercise).

One remarkable fact that helps us characterize linear maps is that every linear map is determined completely by what it does to a basis. Since every vector $ x \in V$ is a linear combination of basis elements, say $ x=a_1v_1 + \dots + a_nv_n$, we see that a linear map plays nicely:

$ f(x) = f(a_1v_1 + \dots + a_nv_n) = a_1f(v_1) + \dots + a_nf(v_n)$

In other words, if we know what $ f$ does to a basis, then we know everything about $ f$. In order to aid our computations, we write what $ f$ does to each basis vector in a tabular form. To elaborate on the vague word “does,” we need to also fix a basis of our target vector space $ W$, say $ (w_1, \dots , w_m)$, and describe each $ f(v_i)$ in terms of this basis. We write it in tabular form, as follows:

$ \begin{pmatrix} | & | & \mathbf{ } & | \\ f(v_1) & f(v_2) & \dots & f(v_n) \\ | & | & \mathbf{ } & | \end{pmatrix}$

The $ j$th column corresponds to $ f(v_j)$, and the $ i$th row corresponds to the $ i$th coefficient in the expansion of $ f(v_j)$ in terms of the basis for $ W$. Here the vertical bars indicate that each element is a column of scalars. We will do an extended example to make this clear.

Consider the map $ f$ on $ \mathbb{R}^3$ defined as $ (x,y,z) \mapsto (y,x,2z+y)$. It is easy to check this map is linear, and using the standard basis we see that

$ f(1,0,0) = (0,1,0)$,
$ f(0,1,0) = (1,0,1)$, and
$ f(0,0,1) = (0,0,2)$.

or,

$ f(e_1) = e_2$, $ f(e_2) = e_1 + e_3$, and $ f(e_3) = 2e_3$.

Hence, the matrix representation of $ f$ with respect to the standard basis is

$ A = \begin{pmatrix} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 2 \end{pmatrix}$

Now we see that if we take a (column) vector $ x$, and multiply it on the left by our matrix $ A$, the resulting vector is precisely the coordinate representation of $ f(x)$ with respect to the basis for $ W$. In fact, the rules for matrix multiplication were constructed very particularly so that this would be the case. In this way, we may arbitrarily switch between viewing $ f$ as a transformation and a vector computation. Compositions of linear maps translate to multiplication of two matrices, and matrix inversion (if it exists) is precisely function inversion.

Of course, there are many different bases we could have chosen. Even though we are going from $ \mathbb{R}^3 \to \mathbb{R}^3$, the column basis could be different from the row basis. Fortunately for our purposes, we are not going to consider what basis is appropriate to choose. All that matters is that fixing a basis, the matrix representation of a linear map is unique, and so we may interchange the notation freely. Even so, the truly interesting things about matrices are those properties which are true no matter which basis we prefer to use.

Eigenvectors and Eigenvalues

Definition: A scalar $ \lambda \in F$ is an eigenvalue for the linear map $ A$ if there exists a non-zero vector $ v \in V$ with $ Av = \lambda v$. Any such vector $ v$ which satisfies this equation is said to be an eigenvector of $ A$ corresponding to $ \lambda$.

Eigenvectors and eigenvalues have a huge number of applications, including facial recognition software, geology, quantum mechanics, and web search. So being able to find them quickly is of great significance to researchers and engineers. What’s interesting is that while eigenvectors depend on a choice of basis, eigenvalues do not. We prove this now:

Proposition: If $ A$ and $ B$ are different representations of the same linear map, then any eigenvalue of $ B$ is an eigenvalue of $ A$.

Proof. It turns out that the process of “changing a basis” can be boiled down to matrix multiplication. Specifically, if $ A$ and $ B$ are two different matrix representations of the same linear map, we have the existence of some invertible matrix $ P$ such that $ A = PBP^{-1}$, or $ AP = PB$. As a result, if $ v$ is an eigenvector for $ B$ corresponding to the eigenvalue $ \lambda$, then for some $ APv = PBv = P \lambda v = \lambda Pv$ and so $ A(Pv) = \lambda(Pv)$, and $ Pv$ is an eigenvector for $ A$ corresponding to $ \lambda$ as well. This proves that eigenvalues are invariant with respect to a change of basis, as desired. $ \square$

The point of this is that we can choose whatever basis we want to work with, and compute the eigenvalues where we’re most comfortable. For instance, if we choose a basis that gives the following diagonal representation,

$ A = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3 \end{pmatrix}$

then we can just eyeball that the eigenvalues are 1, 2, and 3. In fact, there are some very deep theorems in linear algebra that concern the existence and uniqueness of certain matrix representations. For a more in-depth treatment, see Axler, Linear Algebra Done Right. We will cover all the necessary information in the relevant posts, but until then, we are absolutely pooped from typing. Until next time!

Google’s PageRank – Introduction

Importance on the Web

As a society living in the “Information Age,” it comes as no surprise that we are faced with the task of sorting through vast oceans of content. With the admission that most content is actually junk, we must wisely choose the objects of our analysis. The appropriately named site UselessJunk.com certainly doesn’t deserve the same attention as the BBC World News page, and yet within the monstrous heart of the internet, it requires the maturity of the human psyche to discriminate their relative worth.

However, as with many real-world systems, there is a method to the madness. We have the following two important observations to inspire our quest:

  • The internet is a human construction, and hence
  • The internet intrinsically reflects some level of human interest in its content.

If the second statement is true, that the internet somehow reflects what information people want to absorb (or as is often these days, games they want to play, videos they want to watch, restaurants they want to find), then all we need to do is extract this information without human aid. In other words, by first picking out information that is generally desirable, we can reduce the black hole of available (and mostly useless) information to a manageable size for human consumption.

Search Engines

From a much sharper perspective, information selection is precisely what Google does for us every day. If there’s something we’d like to know more about, we type in the relevant search keywords, and out comes a list of pages sorted (more or less) in order by relevance. Of course, “relevance” is not the correct term, but rather “popularity,” and hence “authority,” but we will explain why this characterization makes more sense later. For now, we’ll provide an overview of what a search engine does:

A search engine has four main parts:

  • A crawler
  • An indexer
  • A ranking algorithm, and
  • A query engine

A web crawler methodically copies pages from the web into a database, keeping track of hyperlinks between pages. An indexer then revisits these copies, determining relevant keywords to optimize their later retrieval (leaning on immense amounts of work in cognitive psychology, linguistics, and mathematics). The pages are then ranked according to a particular ranking algorithm. And finally the user is provided with a query engine (the search bar) to access these records, which are displayed in order according to the ranking algorithm.

While each part above is a fascinating problem in itself, we will focus primarily on the third: the ranking algorithm. In its infancy, Google’s novel approach rested in its ranking algorithm, affectionately named PageRank after co-founder Larry Page (other co-founder Sergey Brin). Though it is not presented here in a readable way, this is (a condensed version of) the original paper presenting the Google search engine. Of course, Larry and Sergey needed much more elbow grease to make Google as slick as it is today, and likewise the paper delves into details on local data storage, parsing, chunking, and all sorts of error-handling mechanisms that are beyond the scope of this series. We will stick to investigating and proving the validity of the mathematical model underlying PageRank itself.

And so, the predominant questions throughout this series will be:

  • Which websites on the internet are worth my time? and, more importantly,
  • How can we extract this from the structure of the internet?

The answer to the second question lies in a jungle of fantastically interconnected maths. First, we’ll represent the structure of the web as a directed graph. Then, we’ll compute importance of every web page simultaneously by solving a (very large) system of linear equations. While this sounds straightforward enough, there will be a number of bumps along the way to a satisfactory solution, and we will derive each twist and turn to bask in the elegance of the process.

For those of you who are unfamiliar with graph theory and linear algebra, we plan to provide additional (terse) primers for the necessary definitions and basic notions. Otherwise, look forward to the next part in this series, where we’ll construct PageRank for an internet with some strong hypotheses restricting its form. Until then!

Page Rank Series
An Introduction
A First Attempt
The Final Product
Why It Doesn’t Work Anymore