# k-Means Clustering and Birth Rates

A common problem in machine learning is to take some kind of data and break it up into “clumps” that best reflect how the data is structured. A set of points which are all collectively close to each other should be in the same clump.

A simple picture will clarify any vagueness in this:

Here the data consists of points in the plane. There is an obvious clumping of the data into three pieces, and we want a way to automatically determine which points are in which clumps. The formal name for this problem is the clustering problem. That is, these clumps of points are called clusters, and there are various algorithms which find a “best” way to split the data into appropriate clusters.

The important applications of this are inherently similarity-based: if our data comes from, say, the shopping habits of users of some website, we may want to target a group of shoppers who buy similar products at similar times, and provide them with a coupon for a specific product which is valid during their usual shopping times. Determining exactly who is in that group of shoppers (and more generally, how many groups there are, and what the features the groups correspond to) if the main application of clustering.

This is something one can do quite easily as a human on small visualizable datasets, but the usual the digital representation (a list of numeric points with some number of dimensions) doesn’t yield any obvious insights. Moreover, as the data becomes more complicated (be it by dimension increase, data collection errors, or sheer volume) the “human method” can easily fail or become inconsistent. And so we turn to mathematics to formalize the question.

In this post we will derive one possible version of the clustering problem known as the k-means clustering or centroid clustering problem, see that it is a difficult problem to solve exactly, and implement a heuristic algorithm in place of an exact solution.

And as usual, all of the code used in this post is freely available on this blog’s Github page.

## Partitions and Squared Deviations

The process of clustering is really a process of choosing a good partition of the data. Let’s call our data set $S$, and formalize it as a list of points in space. To be completely transparent and mathematical, we let $S$ be a finite subset of a metric space $(X,d)$, where $d$ is our distance metric.

Definition: We call a partition of a set $S$ a choice of subsets $A_1, \dots, A_n$ of $S$ so that every element of $S$ is in exactly one of the $A_i$.

A couple of important things to note about partitions is that the union of all the $A_i$ is $S$, and that any two $A_i, A_j$ intersect trivially. These are immediate consequences of the definition, and together provide an equivalent, alternative definition for a partition. As a simple example, the even and odd integers form a partition of the whole set of integers.

There are many different kinds of clustering problems, but every clustering problem seeks to partition a data set in some way depending on the precise formalization of the goal of the problem. We should note that while this section does give one of many possible versions of this problem, it culminates in the fact that this formalization is too hard to solve exactly. An impatient reader can safely skip to the following section where we discuss the primary heuristic algorithm used in place of an exact solution.

In order to properly define the clustering problem, we need to specify the desired features of a cluster, or a desired feature of the set of all clusters combined. Intuitively, we think of a cluster as a bunch of points which are all close to each other. We can measure this explicitly as follows. Let $A$ be a fixed subset of the partition we’re interested in. Then we might want to optimize the sum of all of the distances of pairs of points within $A$ to be a measure of it’s “clusterity.” In symbols, this would be

$\displaystyle \sum_{x \neq y \in A} d(x, y)$

If this quantity is small, then it says that all of the points in the cluster $A$ are close to each other, and $A$ is a good cluster. Of course, we want all clusters to be “good” simultaneously, so we’d want to minimize the sum of these sums over all subsets in the partition.

Note that if there are $n$ points in $A$, then the above sum involves $\choose{n}{2} \sim n^2$ distance calculations, and so this could get quite inefficient with large data sets. One of the many alternatives is to pick a “center” for each of the clusters, and try to minimize the sum of the distances of each point in a cluster from its center. Using the same notation as above, this would be

$\displaystyle \sum_{x \in A} d(x, c)$

where $c$ denotes the center of the cluster $A$. This only involves $n$ distance calculations, and is perhaps a better measure of “clusterity.” Specifically, if we use the first option and one point in the cluster is far away from the others, we essentially record that single piece of information $n - 1$ times, whereas in the second we only record it once.

The method we will use to determine the center can be very general. We could use one of a variety of measures of center, like the arithmetic mean, or we could try to force one of the points in $A$ to be considered the “center.” Fortunately, the arithmetic mean has the property that it minimizes the above sum for all possible choices of $c$. So we’ll stick with that for now.

And so the clustering problem is formalized.

Definition: Let $(X,d)$ be a metric space with metric $d$, and let $S \subset (X,d)$ be a finite subset. The centroid clustering problem is the problem of finding for any positive integer $k$ a partition $\left \{ A_1 ,\dots A_k \right \}$ of $S$ so that the following quantity is minimized:

$\displaystyle \sum_{i=1}^k\sum_{x \in A_i} d(x, c(A_i))$

where $c(A_i)$ denotes the center of a cluster, defined as the arithmetic mean of the points in $A_i$:

$\displaystyle c(A) = \frac{1}{|A|} \sum_{x \in A} x$

Before we continue, we have a confession to make: the centroid clustering problem is prohibitively difficult. In particular, it falls into a class of problems known as NP-hard problems. For the working programmer, NP-hard means that there is unlikely to be an exact solution to the problem which is better than trying all possible partitions.

We’ll touch more on this after we see some code, but the salient fact is that a heuristic algorithm is our best bet. That is, all of this preparation with partitions and squared deviations really won’t come into the algorithm design at all. Formalizing this particular problem in terms of sets and a function we want to optimize only allows us to rigorously prove it is difficult to solve exactly. And so, of course, we will develop a naive and intuitive heuristic algorithm to substitute for an exact solution, observing its quality in practice.

Lloyd’s Algorithm

The most common heuristic for the centroid clustering problem is Lloyd’s algorithm, more commonly known as the k-means clustering algorithm. It was named after its inventor Stuart Lloyd, a University of Chicago graduate and member of the Manhattan project who designed the algorithm in 1957 during his time at Bell Labs.

Heuristics tend to be on the simpler side, and Lloyd’s algorithm is no exception. We start by fixing a number of clusters $k$ and choosing an arbitrary initial partition $A = \left \{ A_1, \dots, A_k \right \}$. The algorithm then proceeds as follows:

repeat:
compute the arithmetic mean c[i] of each A[i]
construct a new partition B:
each subset B[i] is given a center c[i] computed from A
x is assigned to the subset B[i] whose c[i] is closest
stop if B is equal to the old partition A, else set A = B

Intuitively, we imagine the centers of the partitions being pulled toward the center of mass of the points in its currently assigned cluster, and then the points deciding selectively who to pull towards them. (Indeed, precisely because of this the algorithm may not always give sensible results, but more on that later.)

One who is in tune with their inner pseudocode will readily understand the above algorithm. But perhaps the simplest way to think about this algorithm is functionally. That is, we are constructing this partition-updating function $f$ which accepts as input a partition of the data and produces as output a new partition as follows: first compute the mean of centers of the subsets in the old partition, and then create the new partition by gathering all the points closest to each center. These are the fourth and fifth lines of the pseudocode above.

Indeed, the rest of the pseudocode is merely pomp and scaffolding! What we are really after is a fixed point of the partition-updating function $f$. In other words, we want a partition $P$ such that $f(P) = P$. We go about finding one in this algorithm by applying $f$ to our initial partition $A$, and then recursively applying $f$ to its own output until we no longer see a change.

Perhaps we should break away from traditional pseudocode illegibility and rewrite the algorithm as follows:

define updatePartition(A):
let c[i] = center(A[i])
return a new partition B:
each B[i] is given the points which are closest to c[i]

compute a fixed point by recursively applying
updatePartition to any initial partition.

Of course, the difference between these pseudocode snippets is just the difference between functional and imperative programming. Neither is superior, but the perspective of both is valuable in its own right.

And so we might as well implement Lloyd’s algorithm in two such languages! The first, weighing in at a whopping four lines, is our Mathematica implementation:

closest[x_, means_] :=
means[[First[Ordering[Map[EuclideanDistance[x, #] &, means]]]]];

partition[points_, means_] := GatherBy[points, closest[#, means]&];
updatePartition[points_, old_] := partition[points, Map[Mean, old]];

kMeans[points_, initialMeans_] := FixedPoint[updatePartition[points, #]&, partition[points, initialMeans]];

While it’s a little bit messy (as nesting 5 function calls and currying by hand will inevitably be), the ideas are simple. The “closest” function computes the closest mean to a given point $x$. The “partition” function uses Mathematica’s built-in GatherBy function to partition the points by the closest mean; GatherBy[L, f] partitions its input list L by putting together all points which have the same value under f. The “updatePartition” function creates the new partition based on the centers of the old partition. And finally, the “kMeans” function uses Mathematica’s built-in FixedPoint function to iteratively apply updatePartition to the initial partition until there are no more changes in the output.

Indeed, this is as close as it gets to the “functional” pseudocode we had above. And applying it to some synthetic data (three randomly-sampled Gaussian clusters that are relatively far apart) gives a good clustering in a mere two iterations:

Indeed, we rarely see a large number of iterations, and we leave it as an exercise to the reader to test Lloyd’s algorithm on random noise to see just how bad it can get (remember, all of the code used in this post is available on this blog’s Github page). One will likely see convergence on the order of tens of iterations. On the other hand, there are pathologically complicated sets of points (even in the plane) for which Lloyd’s algorithm takes exponentially long to converge to a fixed point. And even then, the solution is never guaranteed to be optimal. Indeed, having the possibility for terrible run time and a lack of convergence is one of the common features of heuristic algorithms; it is the trade-off we must make to overcome the infeasibility of NP-hard problems.

Our second implementation was in Python, and compared to the Mathematica implementation it looks like the lovechild of MUMPS and C++. Sparing the reader too many unnecessary details, here is the main function which loops the partition updating, a la the imperative pseudocode:

def kMeans(points, k, initialMeans, d=euclideanDistance):
oldPartition = []
newPartition = partition(points, k, initialMeans, d)

while oldPartition != newPartition:
oldPartition = newPartition
newMeans = [mean(S) for S in oldPartition]
newPartition = partition(points, k, newMeans, d)

return newPartition


We added in the boilerplate functions for euclideanDistance, partition, and mean appropriately, and the reader is welcome to browse the source code for those.

## Birth and Death Rates Clustering

To test our algorithm, let’s apply it to a small data set of real-world data. This data will consist of one data point for each country consisting of two features: birth rate and death rate, measured in annual number of births/deaths per 1,000 people in the population. Since the population is constantly changing, it is measured at some time in the middle of the year to act as a reasonable estimate to the median of all population values throughout the year.

The raw data comes directly from the CIA’s World Factbook data estimate for 2012. Formally, we’re collecting the “crude birth rate” and “crude death rate” of each country with known values for both (some minor self-governing principalities had unknown rates). The “crude rate” simply means that the data does not account for anything except pure numbers; there is no compensation for the age distribution and fertility rates. Of course, there are many many issues affecting the birth rate and death rate, but we don’t have the background nor the stamina to investigate their implications here. Indeed, part of the point of studying learning methods is that we want to extract useful information from the data without too much human intervention (in the form of ambient knowledge).

Here is a plot of the data with some interesting values labeled (click to enlarge):

Specifically, we note that there is a distinct grouping of the data into two clusters (with a slanted line apparently separating the clusters). As a casual aside, it seems that the majority of the countries in the cluster on the right are countries with active conflict.

Applying Lloyd’s algorithm with $k=2$ to this data results in the following (not quite so good) partition:

Note how some of the points which we would expect to be in the “left” cluster are labeled as being in the right. This is unfortunate, but we’ve seen this issue before in our post on k-nearest-neighbors: the different axes are on different scales. That is, death rates just tend to vary more wildly than birth rates, and the two variables have different expected values.

Compensating for this is quite simple: we just need to standardize the data. That is, we need to replace each data point with its deviation from the mean (with respect to each coordinate) using the usual formula:

$\displaystyle z = \frac{x - \mu}{\sigma}$

where for a random variable $X$, its (sample) expected value is $\mu$ and its (sample) standard deviation is $\sigma$. Doing this in Mathematica is quite easy:

Transpose[Map[Standardize, Transpose[L]]]

where L is a list containing our data. Re-running Lloyd’s algorithm on the standardized data gives a much better picture:

Now the boundary separating one cluster from the other is in line with what our intuition dictates it should be.

## Heuristics… The Air Tastes Bitter

We should note at this point that we really haven’t solved the centroid clustering problem yet. There is one glaring omission: the choice of $k$. This question is central to the problem of finding a good partition; a bad choice can yield bunk insights at best. Below we’ve calculated Lloyd’s algorithm for varying values of $k$ again on the birth-rate data set.

Lloyd’s algorithm processed on the birth-rate/death-rate data set with varying values of k between 2 and 7 (click to enlarge).

The problem of finding $k$ has been addressed by many a researcher, and unfortunately the only methods to find a good value for $k$ are heuristic in nature as well. In fact, many believe that to determine the correct value of $k$ is a learning problem in of itself! We will try not to go into too much detail about parameter selection here, but needless to say it is an enormous topic.

And as we’ve already said, even if the correct choice of $k$ is known, there is no guarantee that Lloyd’s algorithm (or any algorithm attempting to solve the centroid clustering problem) will converge to a global optimum solution. In the same fashion as our posts on cryptoanalysis and deck-stacking in Texas Hold ‘Em, the process of finding a minimum can converge to a local minimum.

Here is an example with four clusters, where each frame is a step, and the algorithm progresses from left to right (click to enlarge):

One way to alleviate the issues of local minima is the same here as in our other posts: simply start the algorithm over again from a different randomly chosen starting point. That is, as in our implementations above, our “initial means” are chosen uniformly at random from among the data set points. Alternatively, one may randomly partition the data (without respect to any center; each data point is assigned to one of the $k$ clusters with probability $1/k$). We encourage the reader to try both starting conditions as an exercise, and implement the repeated algorithm to return that output which minimizes the objective function (as detailed in the “Partitions and Squared Deviations” section).

And even if the algorithm will converge to a global minimum, it might not be the case that it does so efficiently. As we already mentioned, solving the problem of centroid clustering (even for a fixed $k$) is NP-hard. And so (assuming $\textup{P} \neq \textup{NP}$) any algorithm which converges to a global minimum will take exponentially long on some pathological inputs. The interested reader will see this exponentially slow convergence even in the case of k=2 for points in the plane (that is as simple as it gets).

These kinds of reasons make Lloyd’s algorithm and the centroid clustering problem a bit of a poster child of machine learning. In theory it’s difficult to solve exactly, but it has an efficient and widely employed heuristic used in practice which is often good enough. Moreover, since the exact solution is more or less hopeless, much of the focus has shifted to finding randomized algorithms which on average give solutions that are within some constant-factor approximation of the true minimum.

## A Word on Expectation Maximization

This algorithm shares quite a bit of features with a very famous algorithm called the Expectation-Maximization algorithm. We plan to investigate this after we spend some more time on probability theory on this blog, but the (very rough) idea is that the algorithm operates in two steps. First, a measure of “center” is chosen for each of a number of statistical models based on given data. Then a maximization step occurs which chooses the optimal parameters for those statistical models, in the sense that the probability that the data was generated by statistical models with those parameters is maximized. These statistical models are then used as the “old” statistical models whose centers are computed in the next step.

Continuing the analogy with clustering, one feature of expectation-maximization that makes it nice is it allows the sizes of the “clusters” to have varying sizes, whereas Lloyd’s algorithm tends to make its clusters have equal size (as we saw with varying values of $k$ in our birth-rates example above).

And so the ideas involved in this post are readily generalizable, and the applications extend to a variety of fields like image reconstruction, natural language processing, and computer vision. The reader who is interested in the full mathematical details can see this tutorial.

Until next time!

# The Cellular Automaton Method for Cave Generation

Dear reader, this post has an interactive simulation! We encourage you to play with it as you read the article below.

In our series of posts on cellular automata, we explored Conway’s classic Game of Life and discovered some interesting patterns therein. And then in our primers on computing theory, we built up a theoretical foundation for similar kinds of machines, including a discussion of Turing machines and the various computational complexity classes surrounding them. But cellular automata served us pretty exclusively as a toy. It was a basic model of computation, which we were interested in only for its theoretical universality. One wouldn’t expect too many immediately practical (and efficient) applications of something which needs a ridiculous scale to perform basic logic. In fact, it’s amazing that there are as many as there are.

In this post we’ll look at one particular application of cellular automata to procedural level generation in games.

An example of a non-randomly generated cave level from Bethesda’s The Elder Scrolls series.

## The Need for More Caves

Level design in video games is a time-consuming and difficult task. It’s extremely difficult for humans to hand-craft areas that both look natural and are simultaneously fun to play in. This is particularly true of the multitude of contemporary role-playing games modeled after Dungeons and Dragons, in which players move through a series of areas defeating enemies, collecting items, and developing their character. With a high demand for such games and so many levels in each game, it would save an unfathomable amount of money to have computers generate the levels on the fly. Perhaps more importantly, a game with randomly generated levels inherently has a much higher replay value.

The idea of randomized content generation (often called procedural generation) is not particularly new. It has been around at least since the 1980′s. Back then, computers simply didn’t have enough space to store large, complex levels in memory. To circumvent this problem, video game designers simply generated the world as the player moved through it. This opened up an infinitude of possible worlds for the user to play in, and the seminal example of this is a game called Rogue, which has since inspired series such as Diablo, Dwarf Fortress, and many many others. The techniques used to design these levels have since been refined and expanded into a toolbox of techniques which have become ubiquitous in computer graphics and game development.

We’ll explore more of these techniques in the future, but for now we’ll see how a cellular automaton can be used to procedurally generate two-dimensional cave-like maps.

## A Quick Review of Cellular Automata

While the interested reader can read more about cellular automata on this blog, we will give a quick refresher here.

For our purposes here, a 2-dimensional cellular automaton is a grid of cells $G$, where each cell $c \in G$ is in one of a fixed number of states, and has a pre-determined and fixed set of neighbors. Then $G$ is updated by applying a fixed rule to each cell simultaneously, and the process is repeated until something interesting happens or boredom strikes the observer. The most common kind of cellular automaton, called a ‘Life-like automaton,’ has only two states, ‘dead’ and ‘alive’ (for us, 0 and 1), and the rule applied to each cell is given as conditions to be ‘born’ or ‘survive’ based on the number of adjacent live cells. This is often denoted “Bx/Sy” where x and y are lists of single digit numbers. Furthermore, the choice of neighborhood is the eight nearest cells (i.e., including the diagonally-adjacent ones). For instance, B3/S23 is the cellular automaton rule where a cell is born if it has three living neighbors, and it survives if it has either two or three living neighbors, and dies otherwise. Technically, these are called ‘Life-like automata,’ because they are modest generalizations of Conway’s original Game of Life. We give an example of a B3/S23 cellular automaton initialized by a finite grid of randomly populated cells below. Note that each of the black (live) cells in the resulting stationary objects satisfy the S23 part of the rule, but none of the neighboring white (dead) cells satisfy the B3 condition.

A cellular automaton should really be defined for an arbitrary graph (or more generally, an arbitrary state space). There is really nothing special about a grid other than that it’s easy to visualize. Indeed, some cellular automata are designed for hexagonal grids, others are embedded on a torus, and still others are one- or three-dimensional. Of course, nothing stops automata from existing in arbitrary dimension, or from operating with arbitrary (albeit deterministic) rules, but to avoid pedantry we won’t delve into a general definition here. It would take us into a discussion of discrete dynamical systems (of which there are many, often with interesting pictures).

## It All Boils Down to a Simple Rule

Now the particular cellular automaton we will use for cave generation is simply B678/S345678, applied to a random initial grid with a fixed live border. We interpret the live cells as walls, and the dead cells as open space. This rule should intuitively work: walls will stay walls even if more cells are born nearby, but isolated or near-isolated cells will often be removed. In other words, this cellular automaton should ‘smooth out’ a grid arrangement to some extent. Here is an example animation quickly sketched up in Mathematica to witness the automaton in action:

An example cave generated via the automaton rule B678/S345678. The black cells are alive, and the white cells are dead.

As usual, the code to generate this animation (which is only a slight alteration to the code used in our post on cellular automata) is available on this blog’s Github page.

This map is already pretty great! It has a number of large open caverns, and they are connected by relatively small passageways. With a bit of imagination, it looks absolutely cavelike!

We should immediately note that there is no guarantee that the resulting regions of whitespace will be connected. We got lucky with this animation, in that there are only two disconnected components, and one is quite small. But in fact one can be left with multiple large caves which have no connecting paths.

Furthermore, we should note the automaton’s rapid convergence to a stable state. Unlike Conway’s Game of Life, in practice this automaton almost always converges within 15 steps, and this author has yet to see any oscillatory patterns. Indeed, they are unlikely to exist because the survival rate is so high, and our initial grid has an even proportion of live and dead cells. There is no overpopulation that causes cells to die off, so once a cell is born it will always survive. The only cells that do not survive are those that begin isolated. In a sense, B678/S345678 is designed to prune the sparse areas of the grid, and fill in the dense areas by patching up holes.

We should also note that the initial proportion of cells which are alive has a strong effect on the density of the resulting picture.  For the animation we displayed above, we initially chose that 45% of the cells would be live. If we increase that a mere 5%, we get a picture like the following.

A cave generated with the initial proportion of live cells equal to 0.5

As expected, there are many more disconnected caverns. Some game designers prefer a denser grid combined with heuristic methods to connect the caverns. Since our goal is just to explore the mathematical ideas, we will leave this as a parameter in our final program.

## Javascript Implementation, and Greater Resolution

One important thing to note is that B678/S345678 doesn’t scale well to fine grid sizes. For instance, if we increase the grid size to 200×200, we get something resembling an awkward camouflage pattern.

A 200×200 grid cave generation. Click the image to enlarge it.

What we really want is a way to achieve the major features of the low-resolution image on a larger grid. Since cellular automata are inherently local manipulations, we should not expect any modification of B678/S345678 to do this for us. Instead, we will use B678/345678 to create a low-resolution image, increase its resolution manually, and smooth it out with — you guessed it — another cellular automaton! We’ll design this automaton specifically for the purpose of smoothing out corners.

To increase the resolution, we may simply divide the cells into four pieces. The picture doesn’t change, but the total number of cells increases fourfold. There are a few ways to do this programmatically, but the way we chose simply uses the smallest resolution possible, and simulates higher resolution by doing block computations. The interested programmer can view our Javascript program available on this blog’s Github page to see this directly (or view the page source of this post’s interactive simulator).

To design a “smoothing automaton,” we should investigate more closely what we need to improve on in the above examples. In particular, once we increase the resolution, we will have a lot of undesirable convex and concave corners. Since a “corner” is simply a block satisfying certain local properties, we can single those out to be removed by an automaton. It’s easy to see that convex corners have exactly 3 live neighbors, so we should not allow those cells to survive. Similarly, the white cell just outside a concave corner has 5 live neighbors, so we should allow that cell to be born. On the other hand, we still want the major properties of our old B678/S345678 to still apply, so we can simply add 5 to the B part and remove 3 from the S part. Lastly, for empirical reasons, we also decide to kill off cells with 4 live neighbors.

And so our final “smoothing automaton” is simply B5678/S5678.

We present this application as an interactive javascript program. Some basic instructions:

• The “Apply B678/S345678″ button does what you’d expect: it applies B678/S345678 to the currently displayed grid. It iterates the automaton 20 times in an animation.
• The “Apply B5678/S5678″ button applies the smoothing automaton, but it does so only once, allowing the user to control the degree of smoothing at the specific resolution level.
• The “Increase Resolution” button splits each cell into four, and may be applied until the cell size is down to a single pixel.
• The “Reset” button resets the entire application, creating a new random grid.

We used this program to generate a few interesting looking pictures by varying the order in which we pressed the various buttons (it sounds silly, but it’s an exploration!). First, a nice cave:

An example of a higher resolution cave created with our program. In order to achieve similar results, First apply B678/S345678, and then alternate increasing the resolution and applying B5678/S5678 1-3 times.

We note that this is not perfect. There are some obvious and awkward geometric artifacts lingering in this map, mostly in the form of awkwardly straight diagonal lines and awkwardly flawless circles. Perhaps one might imagine the circles are the bases of stalactites or stalagmites. But on the whole, in terms of keeping the major features of the original automaton present while smoothing out corners, this author thinks B5678/S5678 has done a phenomenal job. Further to the cellular automaton’s defense, when the local properties are applied uniformly across the entire grid, such regularities are bound to occur. That’s just another statement of the non-chaotic nature of B5678/S5678 (in stark contrast to Conway’s Game of Life).

There are various modifications one could perform (or choose not to, depending on the type of game) to make the result more accessible for the player. For instance, one could remove all regions which fit inside a sufficiently small circle, or add connections between the disconnected components at some level of resolution. This would require some sort of connected-component labeling, which is a nontrivial task; current research goes into optimizing connected-component algorithms for large-scale grids. We plan to cover such topics on this blog in the future.

Another example of a cool picture we created with this application might be considered a more “retro” style of cave.

Apply S678/B345678 once, and increase the resolution as much as possible before applying B5678/S5678 as many times as desired.

We encourage the reader to play around with the program to see what other sorts of creations one can make. As of the time of this writing, changing the initial proportion of live cells (50%) or changing the automaton rules cannot be done in the browser; it requires one to modify the source code. We may implement the ability to control these in the browser given popular demand, but (of course) it would be a wonderful exercise for the intermediate Javascript programmer.

## Caves in Three Dimensions

It’s clear that this same method can be extended to a three-dimensional model for generating caverns in a game like Minecraft. While we haven’t personally experimented with three-dimensional cellular automata here on this blog, it’s far from a new idea. Once we reach graphics programming on this blog (think: distant future) we plan to revisit the topic and see what we can do.

Until then!

# Numerical Integration

## Rectangles, Trapezoids, and Simpson’s

I just wrapped up a semester of calculus TA duties, and I thought it would be fun to revisit the problem of integration from a numerical standpoint. In other words, the goal of this article is to figure out how fast we can approximate the definite integral of a function $f:\mathbb{R} \to \mathbb{R}$. Intuitively, a definite integral is a segment of the area between a curve $f$ and the $x$-axis, where we allow area to be negative when $f(x) < 0$.

If any of my former students are reading this, you should note that we will touch on quite a number of important ideas from future calculus courses, but until you see them in class, the proofs may seem quite mystical.

As usual, the source code for this post is available on this blog’s Github page.

## The Baby Integral

Let’s quickly recall the definition of an integral $\int_a^b f(x) dx$:

Definition: Let $P = \left \{ [a_i, b_i] \right \}_{i = 1}^n$ be a partition of an interval $[a,b]$ into $n$ sub-intervals, let $\Delta x_i = b_i - a_i$ be the length of each interval, and let $\widehat{x_i}$ be a chosen point inside $[a_i,b_i]$ (which we call a tag). Then a Riemann sum of $f$ from $a$ to $b$ is a sum

$\displaystyle R(f, P) = \sum_{i = 1}^n f(\widehat{x_i}) \Delta x_i$.

Geometrically, a Riemann sum approximates the area under the curve $f$ by using sufficiently small rectangles whose heights are determined by the tagged points. The terms in the sum above correspond to the areas of the approximating rectangles.

We note that the intervals in question need not have the same lengths, and the points $\widehat{x_i}$ may be chosen in any haphazard way one wishes. Of course, as we come up with approximations, we will pick the partition and tags very deliberately.

Definition: The integral $\displaystyle \int_a^b f(x) dx$ is the limit of Riemann sums as the maximum length of any sub-interval in the partition goes to zero. In other words, for a fixed partition $P$ let $\delta_P = \max_i(\Delta x_i)$. Then

$\displaystyle \int_a^b f(x) dx = \lim_{\delta_P \to 0}R(f, P)$

Another way to put this definition is that if you have any sequence of partitions $P_k$ so that $\delta_{P_k} \to 0$ as $k \to \infty$, then the integral is just the limit of Riemann sums for this particular sequence of partitions.

Our first and most naive attempt at computing a definite integral is to interpret the definition quite literally. The official name for it is a left Riemann sum. We constrain our partitions $P$ so that each sub-interval has the same length, namely $\Delta x = (b-a)/n$. We choose our tags to be the leftmost points in each interval, so that if we name each interval $[a_i, b_i]$, we have $\widehat{x_i} = a_i$. Then we simply use a large enough value of $n$, and we have a good approximation of the integral.

For this post we used Mathematica (gotta love those animations!), but the code to implement this is quite short in any language:

LeftRiemannSum[f_, n_, a_, b_] :=
Module[{width = (b-a)/n},
N[width * Sum[f[a + i*width], {i, 0, n-1}]]
];

Note that we may factor the constant “width” out of the sum, since here it does not depend on the interval. The only other detail is that Mathematica leaves all expressions as exact numbers, unless they are wrapped within a call to N[ ], which stands for “numerical” output. In most general languages numerical approximations are the default.

The computational complexity should be relatively obvious, as we require “one” computation per interval, and hence $n$ computations for $n$ intervals. (Really, it depends on the cost of evaluating $f(x)$, but for the sake of complexity we can assume computing each term is constant.) And so this algorithm is $O(n)$.

However, we should note that our concern is not necessarily computational complexity, but how fast the sum converges. In other words, we want to know how large $n$ needs to be before we get a certain number of decimal places of accuracy.

For all of our tests and visualizations, we will use the following arbitrarily chosen, but sufficiently complicated function on $[0, \pi]$:

$\displaystyle f(x) = 5 \cos(x) \sin^10(x) + \frac{1}{5} \cos^9(x) \exp{\sqrt{x}}$

Our test function.

For n = 15, we have the following left Riemann sum:

A left Riemann Sum with n = 15

Here’s an animation of the sum as $n \to 100$:

An animation of a left Riemann sum where n goes from 2 to 100

Unfortunately, it seems that on the regions where $f$ is increasing, that portion of the Riemann sum is an underestimate, and where $f$ is decreasing, the Riemann sum is an overestimate. Eventually this error will get small enough, but we note that even for $n = 10,000$, the sum requires almost 7 seconds to compute, and only achieves 3 decimal places of accuracy! From a numerical standpoint, left Riemann sums converge slower than paint dries and are effectively useless. We can certainly do better.

## More Geometric Intuition

Continuing with the geometric ideas, we could conceivably pick a better tag in each sub-interval. Instead of picking the left (or right) endpoint, why not pick the midpoint of each sub-interval? Then the rectangles will be neither overestimates nor underestimates, and hence the sums will be inherently more accurate. The change from a left Riemann sum to a midpoint Riemann sum is trivial enough to be an exercise for the reader (remember, the source code for this post is available on this blog’s Github page). We leave it as such, and turn to more interesting methods.

Instead of finding the area of a rectangle under the curve, let’s use a trapezoid whose endpoints are both on the curve. (Recall the area of a trapezoid, if necessary) We call this a trapezoid sum, and a first attempt at the code is not much different from the left Riemann sum:

TrapezoidSum[f_, n_, a_, b_] :=
Module[{width = (b-a)/n},
N[width * Sum[1/2 (f[a + i*width] + f[a + (i+1)*width]),
{i, 0, n-1}]]
];

Here is a picture for $n = 15$:

A trapezoid sum for n = 15

And an animation as $n \to 100$:

An animation of the trapezoid sum as n goes to 100

The animation hints that this method converges much faster than left Riemann sums, and indeed we note that for $n = 100$, the sum requires a mere .16 seconds, yet achieves the three decimal places of accuracy for which the left Riemann sum required $n = 10,000$. This method appears to be a drastic improvement, and indeed plotting the accuracy of left Riemann sums against trapezoid sums gives a nice indication:

Errors of the left Riemann sums (blue, positive) and the trapezoid sums (red, negative) for increasing values of n.

Now that is quite an improvement!

Going back to the code, the computational complexity is again $O(n)$, but we note that at a micro-efficiency level, we are being a bit wasteful. We call the function $f$ twice for each trapezoid, even when adjacent trapezoids share edges and hence base heights. If a call to $f$ is relatively expensive (and it just so happens that calls to Sin, Cos, Exp are somewhat expensive), then this becomes a significant issue. We leave it as an exercise to the adventurous reader to optimize the above code, so that no superfluous calls to $f$ are made (hint: surround the function $f$ with a cache, so that you can reuse old computations).

Before we move on to our final method for this post, we will take a no-so-short aside to give a proof of how accurate the trapezoid rule is. In fact, we will give an upper bound on the error of trapezoid sums based solely on $n$ and easy properties of $f$ to compute. In it’s raw form, we have the following theorem:

Theorem: Supposing $f: [a,b] \to \mathbb{R}$ is thrice differentiable, let $h = (b-a)/n$, let $a_i$ be $a + (i-1)h$, and let $T_n(f)$ be a trapezoidal approximation of $f$ with $n$ trapezoids, as above. Then

$\displaystyle T_n(f) - \int_{a}^b f(x) dx = \frac{b-a}{n} h^2 f''(c)$ for some $c \in [a,b]$

Proof. Let $\varphi_i : [0,h] \to \mathbb{R}$ be defined by

$\displaystyle \varphi_i(t) = \frac{t}{2}(f(a_i) + f(a_i + t)) - \int_{a_i}^{a_i + t} f(x) dx$

We claim that $\sum \limits_{i=1}^n \varphi_i(h) = T_n(f) - \int_a^b f(x) dx$, and one can see this by simply expanding the sum according to the definition of $\varphi_i$. Now we turn to the question of bounding $\varphi_i$ on $[0,h]$.

We note $\varphi_i(0) = 0$, and by the fundamental theorem of calculus:

$\displaystyle \varphi_i'(t) = \frac{1}{2}(f(a_i) - f(a_i + t)) + \frac{t}{2}f'(a_i + t)$

Furthermore, $\varphi_i'(0) = 0$ as is evident by the above equation, and differentiating again gives us

$\displaystyle \varphi_i''(t) = \frac{t}{2}f''(a_i + t)$

With again $\varphi_i''(0) = 0$.

As $f''$ is continuous, the extreme-value theorem says there exist bounds for $f$ on $[a,b]$. We call the lower one $m$ and the upper $M$, so that

$\displaystyle \frac{1}{2}mt \leq \varphi_i''(t) \leq \frac{1}{2}Mt$

Taking definite integrals twice on $[0,t]$, we get

$\displaystyle \frac{1}{12}mt^3 \leq \varphi_i(t) \leq \frac{1}{12}Mt^3$

Then the sum of all the $\varphi_i(h)$ may be bounded by

$\displaystyle \frac{n}{12}mh^3 \leq \sum \limits_{i=1}^n \varphi_i(h) \leq \frac{n}{12}Mh^3$

The definition of $h$ and some simplification gives

$\displaystyle \frac{b-a}{12}mh^2 \leq \sum \limits_{i=1}^n \varphi_i(h) \leq \frac{(b-a)}{12}Mh^2$

And from here we note that by continuity, $f''(x)$ obtains every value between its bounds $m, M$, so that for some $c \in [a,b], f''(c)$ obtains the value needed to make the middle term equal to $\frac{b-a}{12}h^2 f''(c)$, as desired.

As a corollary, we can bound the magnitude of the error by using the larger of $|m|, |M|$, to obtain a fixed value $B$ such that

$\displaystyle \left | T_n(f) - \int_a^b f(x) dx \right | \leq \frac{(b-a)}{12}h^2 B$

$\square$

So what we’ve found is that the error of our trapezoidal approximation (in absolute value) is proportional to the function $1/n^2$. There is a similar theorem about the bounds of a left Riemann sum, and we leave it as an exercise to the reader to find and prove it (hint: use a similar argument, or look at the Taylor expansion of $f$ at the left endpoint).

## Interpolating Polynomials and Simpson’s Rule

One way to interpret the left Riemann sum is that we estimate the integral by integrating a step function which is close to the actual function. For the trapezoidal rule we estimate the function by piecewise lines, and integrate that. Naturally, our next best approximation would be estimating the function by piecewise quadratic polynomials. To pick the right ones, we should investigate the idea of an interpolating polynomial.

Following the same pattern, given one point there is a unique constant function (degree zero polynomial) passing through that point. Given two points there is a unique line (degree one) which contains both. We might expect that given $n+1$ points in the plane (in general position), there is a unique degree $n$ polynomial passing through them.

For three points, $(x_1, y_1), (x_2, y_2), (x_3, y_3)$ we may concoct a working curve as follows (remember $x$ is the variable here)

$\displaystyle \frac{(x-x_1)(x-x_2)}{(x_3-x_1)(x_3-x_2)}y_3 + \frac{(x-x_1)(x-x_3)}{(x_2-x_1)(x_2-x_3)}y_2 + \frac{(x-x_2)(x-x_3)}{(x_1-x_2)(x_1-x_3)}y_1$

Notice that each of the terms are 2nd degree, and plugging in any one of the three given $x$ values annihilates two of the terms, and gives us precisely the right $y$ value in the remaining term. We may extend this in the obvious way to establish the interpolating polynomial for a given set of points. A proof of uniqueness is quite short, as if $p, q$ are two such interpolating polynomials, then $p-q$ has $n+1$ roots, but is at most degree $n$. It follows from the fundamental theorem of algebra that $p-q$ must be the zero polynomial.

If we wish to approximate $f$ with a number of these quadratics, we can simply integrate the polynomials which interpolate $(a_i, f(a_i)), (a_{i+1}, f(a_{i+1})), (a_{i+2}, f(a_{i+2}))$, and do this for $i = 1, 3, \dots, n-2$. This is called Simpson’s Rule, and it gives the next level of accuracy for numerical integration.

With a bit of algebra, we may write the integrals of the interpolating polynomials in terms of the points themselves. Without loss of generality, assume the three points are centered at 0, i.e. the points are $(- \delta, y_1), (0, y_2), (\delta, y_3)$. This is fine because shifting the function left or right does not change the integral. Then the interpolating polynomial is (as above),

$\displaystyle p(x) = \frac{(x+\delta)(x)}{2 \delta^2}y_3 + \frac{(x+\delta)(x-\delta)}{-\delta^2}y_2 + \frac{(x)(x-\delta)}{2 \delta^2}y_1$.

Integrating this over $[-\delta, \delta]$ and simplifying gives the quantity

$\displaystyle \int_{-\delta}^{\delta} p(x) dx = \frac{\delta}{3}(y_1 + 4y_2 + y_3)$.

And so we may apply this to each pair of sub-intervals (or work with the midpoints of our $n$ sub-intervals), to get our quadratic approximation of the entire interval. Note, though that as we range over all sub-intervals, the two endpoints will be counted twice, so our entire approximation is

$f(a) + 2f(a+\delta) + 4f(a+ 2\delta) + 2f(a + 3\delta) + \dots + 2f(a + (2n-1) \delta) + f(b)$

Translating this into code is as straightforward as one could hope:

SimpsonsSum[f_, n_, a_, b_] :=
Module[{width = (b-a)/n, coefficient},
coefficient[i_?EvenQ] = 2;
coefficient[i_?OddQ] = 4;
N[width/3 * (f[a] + f[b] +
Sum[coefficient[i] * f[a + i*width], {i, 1, n-1}])]
];

As usual, here is a picture for $n = 8$:

Simpson’s Rule for n=8 (4 interpolated quadratics)

And an animation showing the convergence:

An animation showing the convergence of Simpson’s Rule

There is a similar bound for Simpson’s Rule as there was for the trapezoid sums. Here it is:

Theorem: Supposing $f$ is five times differentiable, and let $B$ be the maximum of the values $|f^{(4)}(c)|$ for $c$ in $[a,b]$. Then the magnitude of the difference between the true integral $\int_a^b f(x) dx$ and the Simpson’s Rule approximation with $n$ interpolated polynomials is at most

$\displaystyle \frac{(b-a)^5}{180 n^4}B$

The proof is similar in nature to the proof for trapezoid sums, but requires an annoying amount of detail. It’s quite difficult to find a complete proof of the error estimate for reference. This is probably because it is a special case of a family of Newton-Cotes formulas. We leave the proof as a test of gumption, and provide a reference to a slightly more advanced treatment by Louis A. Talman.

As an easier cop-out, we show a plot of the convergence of Simpson’s Rule versus the trapezoid sums:

The error convergence of Simpson’s Rule (red, above), versus the trapezoid sums (blue, below) for increasing values of n.

Judging by the graph, the improvement from trapezoid sums to Simpson’s rule is about as drastic as the improvement from Riemann sums to trapezoid sums.

## Final Thoughts, and Future Plans

There are a host of other integral approximations which fall in the same category as what we’ve done here. Each is increasingly accurate, but requires a bit more computation in each step, and the constants involved in the error bound are based on larger and larger derivatives. Unfortunately, in practice it may hard to bound large derivatives of an arbitrary function, so confidence in the error bounds for simpler methods might be worth the loss of efficiency for some cases.

Furthermore, we always assumed that the length of each sub-interval was uniform across the entire partition. It stands to reason that some functions are wildly misbehaved in small intervals, but well-behaved elsewhere. Consider, for example, $\sin(1/x)$ on $(0, \pi)$. It logically follows that we would not need small intervals for the sub-intervals close to $\pi$, but we would need increasingly small intervals close to 0. We will investigate such methods next time.

We will also investigate the trials and tribulations of multidimensional integrals. If we require 100 evenly spaced points to get a good approximation of a one-dimensional integral, then we would require $100^2$ evenly spaced points for a two-dimensional integral, and once we start working on interesting problems in 36,000-dimensional space, integrals will require $100^{36,000}$ evenly spaced points, which is far greater than the number of atoms in the universe (i.e., far exceeds the number of bits available in all computers, and hence cannot be computed). We will investigate alternative methods for evaluating higher-dimensional integrals, at least one of which will be based on random sampling.

Before we close, we note that even today the question of how to approximate integrals is considered important research. Within the last twenty years there have been papers generalizing these rules to arbitrary spaces, and significant (“clever”) applications to the biological sciences. Here are two examples: Trapezoidal rules in Sobolev spaces, and  Trapezoidal rule and Simpson’s rule for “rough” continuous functions. Of course, as we alluded to, when dimension goes up integrals become exponentially harder to compute. As our world is increasingly filled with high-dimensional data, rapid methods for approximating  integrals in arbitrary dimension is worth quite a bit of money and fame.

Until next time!

# Hunting Serial Killers

## “Tonight’s the Night”

A large volume of research goes into the psychological and behavioral analysis of criminals. In particular, serial criminals hold a special place in the imagination and nightmares of the general public (at least, American public). Those criminals with the opportunity to become serial criminals are logical, cool-tempered, methodical, and, of course, dangerous. They walk among us in crowded city streets, or drive slowly down an avenue looking for their next victim. They are sometimes neurotic sociopaths, and other times amicable, charming models of society and business. But most of all, they know their craft well. They work slowly enough to not make mistakes, but fast enough to get the job done and feel good about it. Their actions literally change lives.

In other words, they would be good programmers. If only they all hadn’t given up trying to learn C++!

In all seriousness, a serial killer’s rigid methodology sometimes admits itself nicely to mathematical analysis. For an ideal serial criminal (ideal in being analyzable), we have the following two axioms of criminal behavior:

1. A serial criminal will not commit crimes too close to his base of operation.
2. A serial criminal will not travel farther than necessary to find victims.

The first axiom is reasonable because a good serial criminal does not want to arouse suspicion from his neighbors. The second axiom roughly describes an effort/reward ratio that keeps serial offenders from travelling too far away from their homes.

These axioms have a large amount of criminological research behind them. While there is little unifying evidence (the real world is far too messy), there are many bits and pieces supporting these claims. For example, the frequency of burglaries peak about a block from the offender’s residence, while almost none occur closer than a block (Turner, 1969, “Delinquency and distance”). Further, many serial rapists commit subsequent rapes (or rather, abductions preceding rape) within a half mile from the previous (LeBeau, 1987, “Patterns of stranger and serial rape offending”). There are tons of examples of these axioms in action in criminology literature.

On the other hand, there are many types of methodical criminals who do not agree with these axioms. Some killers murder while traveling the country, while others pick victims with such specific characteristics that they must hunt in a single location. So we take the following models with a grain of salt, in that they only apply to a certain class of criminal behavior.

With these ideas in mind, if we knew a criminal’s base of operation, we could construct a mathematical model of his “buffer zone,” inside of which he does not commit crime. With high probability, most of his crimes will lie just outside the buffer zone. This in itself is not useful in the grand scheme of crime-fighting. If we know a criminal’s residence, we need not look any further. The key to this model’s usefulness is working in reverse: we want to extrapolate a criminal’s residence from the locations of his crimes. Then, after witnessing a number of crimes we believe to be committed by the same person, we may optimize a search for the offender’s residence. We will use the geographic locations of a criminal’s activity to accurately profile him, hence the name, geoprofiling.

## Murder, She Coded

Historically, the first geoprofiling model was crafted by a criminologist named Dr. Kim Rossmo. Initially, he overlaid the crime locations on a sufficiently fine $n \times m$ grid of cells. Then, he uses his model to calculate the probability of the criminal’s residence lying within each cell. Rossmo’s formula is displayed below, and explained subsequently.

$\displaystyle P(x) = \sum \limits_{\textup{crime locations } c} \frac{\varphi}{d(x,c)^f} + \frac{(1-\varphi)B^{g-f}}{(2B-d(x,c))^g}$,

where $\varphi = 1$ if $d(x,c) > B, 0$ otherwise.

Here, $x$ is an arbitrary cell in the grid, $d(x,c)$ is the distance from a cell to a crime location, with some fixed metric $d$. The variable $\varphi$ determines which of the two summands to nullify based on whether the cell in question is in the buffer zone. $B$ is the radius of the buffer zone, and $f,g$ are formal empirically tuned parameters. Variations in $f$ and $g$ change the steepness of the decay curve before and after the buffer radius. We admit to have no idea why they need to be related, and cannot find a good explanation in Rossmo’s novel of a dissertation. Instead, Rossmo claims both parameters should be equal. For the purposes of this blog we find their exact values irrelevant, and put them somewhere between a half and two thirds.

This model reflects the inherent symmetry in the problem. If we may say that an offender commits a crime outside a buffer of some radius $B$ surrounding his residence, then we may also say that the residence is likely outside a buffer of the same radius surrounding each crime! For a fixed location, we may compute the probability of the offender’s residence being there with respect to each individual crime, and just sum them up.

This equation, while complete, has a better description for programmers, which is decidedly easier to chew in small bites:

Let d = d(x,c)
if d > B:
P(x) += 1/d^f
else:
P(x) += B^(f-g)/(2B-d)^g

Then we may simply loop this routine over over all such $c$ for a fixed $x$, and get our probability. Here we see the ideas clearly, that outside the buffer zone of the crime the probability of residence decreases with a power-law, and within the buffer zone it increases approaching the buffer.

Now, note that these “probabilities” are not, strictly speaking, probabilities, because they are not normalized in the unit interval $[0,1]$. We may normalize them if we wish, but all we really care about are the relative cell values to guide our search for the perpetrator. So we abuse the term “high probability” to mean “relatively high value.”

Finally, the distance metric we actually use in the model is the so-called taxicab metric. Since this model is supposed to be relevant to urban serial criminals (indeed, where the majority of cases occur), the taxicab metric more accurately describes a person’s mental model of distance within a city, because it accounts for roadways. Note that in order for this to work as desired, the map used must be rotated so that its streets lie parallel to the $x,y$ axes. We will assume for the rest of this post that the maps are rotated appropriately, as this is a problem with implementation and not the model itself or our prototype.

Rossmo’s model is very easy to implement in any language, but probably easiest to view and animate in Mathematica. As usual, the entire program for the examples presented here is available on this blog’s Github page. The decay function is just a direct translation of the pseudocode:

rossmoDecay[p1_, p2_, bufferLength_, f_, g_, distance_] :=
With[{d = distance[p1, p2]},
If[d > bufferLength,
1/(d^f),
(bufferLength^(g - f))/(2 bufferLength - d)^g]];

We then construct a function which computes the decay from a fixed cell for each crime site:

makeRossmoFunction[sites_, buffer_, f_, g_] :=
Function[{x, y},
Apply[Plus,
Map[rossmoDecay[#,{x,y},buffer,f,g,ManhattanDistance] &,
sites]]];

Now we may construct a “Rossmo function,” (initializing the parameters of the model), and map the resulting function over each cell in our grid:

Array[makeRossmoFunction[sites, 14, 1/3, 2/3], {60, 50}];

Here the Array function accepts a function $f$, and a specification of the dimensions of the array. Then each array index tuple is fed to $f$, and the resulting number is stored in the $i,j$ entry of the array. Here $f: \mathbb{Z}_{60} \times \mathbb{Z}_{50} \to \mathbb{R}^+$. We use as a test the following three fake crime sites:

sites = {{20, 25}, {47, 10}, {55, 40}};

Upon plotting the resulting array, we have the following pretty picture:

A test of Rossmo’s geographic profiling model on three points.

Here, the crime locations are at the centers of each of the diamonds, and cells with more reddish colors have higher values. Specifically, the “hot spot” for the criminal’s residence is in the darkest red spot in the bottom center of the image.

As usual, in order to better visualize the varying parameters, we have the following two animations:

A variation of the “f” parameter from 0.1 and 1.25 in steps of 0.05. “g” is fixed at 2/3.

Variation in the “g” parameter from 0.1 to 1.25 in steps of 0.05. “f” is fixed at 1/2.

Variation in the $B$ parameter simply increases or decreases the size of the buffer zone. In both animations above we have it fixed at 14 units.

Despite the pretty pictures, a mathematical model is nothing without empirical evidence to support it. Now, we turn to an analysis of this model on real cases.

## “Excellent!” I cried. “Elementary mathematics,” said he.

Richard Chase

The first serial killer we investigate is Richard Chase, also known as the Vampire of Sacramento. One of the creepiest murderers in recent history, Richard Chase believed he had to drink the blood of his victims in order to live. In the month of January 1978, Chase killed five people, dumping their mutilated bodies in locations near his home.

Before we continue with the geographic locations of this particular case, we need to determine which locations are admissible. For instance, we could analyze abduction sites, body drop sites, locations of weapons caches or even where the perpetrator’s car was kept. Unfortunately, many of these locations are not known during an investigation. At best only approximate abduction sites can be used, and stash locations are usually uncovered after an offender is caught.

For the sake of the Chase case, and subsequent cases, we will stick to the most objective data points: the body drop sites. We found this particular data in Rossmo’s dissertation, page 272 of the pdf document. Overlaid on a 30 by 30 grid, they are:

richardChaseSites =
{{3, 17}, {15, 3}, {19, 27}, {21, 22}, {25, 18}};
richardChaseResidence = {19,17};

Then, computing the respective maps, we have the following probability map:

The Rossmo probability map for the Richard Chase body drop sites. Here B = 5, f = 1/2, g = 1

If we overlay the location of Chase’s residence in purple, we see that it is very close to the hottest cell, and well-within the hot zone. In addition, we compare this with another kind of geoprofile: the center of gravity of the five sites. We color the center of gravity in black, and see that it is farther from Chase’s residence than the hot zone. In addition, we make the crime sites easy to see by coloring them green.

Additional data points: center of gravity in black, Chase’s residence in purple, and crime sites in green.

Albert DeSalvo

This is a great result for the model! Let us see how it fares on another case: Albert DeSalvo, the Boston strangler.

With a total of 13 murders and being suspected of over 300 sexual assault charges, DeSalvo is a prime specimen for analysis. DeSalvo entered his victim’s homes with a repertoire of lies, including being a maintenance worker, the building plumber, or a motorist with a broken-down car. He then proceeded to tie his victims to a bed, sexually assault them, and then strangle them with articles of clothing. Sometimes he tied a bow to the cords he strangled his victims with.

We again use the body drop sites, which in this case are equivalent to encounter sites. They are:

deSalvoSites = {{10, 48}, {13, 8}, {15, 11}, {17, 8}, {18, 7},
{18, 9}, {19, 4}, {19, 8}, {20, 9}, {20, 10}, {20, 11},
{29, 23}, {33, 28}};
deSalvoResidence = {19,18};

Running Rossmo’s model again, including the same extra coloring as for the Chase murders, we get the following picture:

The Rossmo probability map for Albert DeSalvo’s murders. Here B=10, f= 1/2, g = 1.

Again, we win. DeSalvo’s residence falls right in the darker of our two main hot zones. With this information, the authorities would certainly apprehend him in a jiffy. On the other hand, the large frequency of murders in the left-hand side pulls the center of gravity too close. In this way we see that the center of gravity is not a good “measure of center” for murder cases. Indeed, it violates the buffer principle, which holds strong in these two cases.

Peter Sutcliffe

Finally, we investigate Peter Sutcliffe, more infamously known as the Yorkshire Ripper. Sutcliffe murdered 13 women and attacked at least 6 others between 1975 and 1980 in the county of West Yorkshire, UK. He often targeted prostitutes, hitting them over the head with a hammer and proceeding to sexually molest and mutilate their bodies. He was finally caught with a prostitute in his car, but was not initially thought to be the Yorkshire Ripper until after police returned to the scene of his arrest to look for additional evidence. They found his murder weapons, and promptly prosecuted him.

We list his crime locations below. Note that these include body drop sites and the attack sites for non-murders, which were later reported to the police.

sutcliffeSites = {{5, 1}, {8, 7}, {50, 99}, {53, 68}, {56, 72},
{59, 59}, {62, 57}, {63, 85}, {63, 87}, {64, 83}, {69, 82},
{73, 88}, {80, 88}, {81, 89}, {83, 88}, {83, 87}, {85, 85},
{85, 83}, {90, 90}};
sutcliffeResidences = {{60, 88}, {58, 81}};

Notice that over the course of his five-year spree, he lived in two residences. One of these he moved to with his wife of three years (he started murdering after marrying his wife). It is unclear whether this changed his choice of body drop locations.

Unfortunately, our attempts to pinpoint Sutcliffe’s residence with Rossmo’s model fail miserably. With one static image, guessing at the buffer radius, we have the following probability map:

Failure in the form of a probability map.

As we see, both the center of gravity and the hot zones are far from either of Sutcliffe’s residences. Indeed, even with a varying buffer radius, we are still led to search in unfruitful locations:

An animation of the buffer radius parameter B varying between 1 and 50. Clearly no buffer will give us the desired probability map. Poop.

Even with all of the axioms, all of the parameters, all of the gosh-darn work we went through! Our model is useless here. This raises the obvious question, exactly how applicable is Rossmo’s model?

## The Crippling Issues

The real world is admittedly more complex than we make it out to be. Whether the criminal is misclassified, bad data is attributed, or the killer has some special, perhaps deranged motivation, there are far too many opportunities for confounding variables to tamper with our results. Rossmo’s model even requires that the killer live in a more or less central urban location, for if he must travel in a specific direction to find victims, he may necessarily produce a skewed distribution of crime locations.

Indeed, we have to have some metric by which to judge the accuracy of Rossmo’s model. While one might propose the distance between the offender’s residence and the highest-probability area produced on the map, there are many others. In particular, since the point of geographic profiling is to prioritize the search for a criminal’s residence, the best metric is likely the area searched before finding the residence. We call this metric search area. In other words, search area is the amount of area on the map which has probability greater or equal to the cell containing the actual residence. Indeed Rossmo touts this metric as the only useful metric.

However, according to his own tests, the amount of area searched on the Sutcliffe case would be over a hundred square miles! In addition, Rossmo neither provides an idea of what amount of area is feasibly searchable, nor any global statistics on what percentage of cases in his study resulted in an area that was feasibly searchable. We postulate our own analysis here.

In a count of Rossmo’s data tables, out of the fifteen individual cases he studied, the average search area was 395 square kilometers, or 152.5 square miles, while the median was about 87 square kilometers, or 33.6 square miles. The maximum is 1829 square kilometers, while the min is 0.2 square kilometers. The complete table is contained in the Mathematica notebook on this blog’s Github page.

From the 1991 census data for Vancouver, we see that a low density neighborhood has an average population of 2,380 individuals per square kilometer, or about 6,000 per square mile. Applying this to our numbers from the previous paragraph, we have a mean of 940,000 people investigated before the criminal is found, a median of 200,000, a max of four million (!), and a min of 309.

Even basing our measurements on the median values, this method appears to be unfeasible as a sole means of search prioritization. Of course, real investigations go on a lot more data, including hunches, to focus search. At best this could be a useful tool for police, but on the median, we believe it would be marginally helpful to authorities prioritize their search efforts. For now, at least, good ol’ experience will likely prevail in hunting serial killers.

In addition, other researchers have tested human intuition at doing the same geographic profiling analysis, and they found that with a small bit of training (certainly no more than reading this blog post), humans showed no significant difference from computers at computing this model. (English, 2008) Of course, for the average human the “computing” process (via pencil and paper) was speedy and more variable, but for experienced professionals the margin of error would likely disappear. As interesting as this model may be, it seems the average case is more like Sutcliffe than Chase; Rossmo’s model is effectively a mathematical curiosity.

It appears, for now, that our friend Dexter Morgan is safe from the threat of discovery by computer search.

## Alternative Models

The idea of a decay function is not limited to Rossmo’s particular equation. Indeed, one might naturally first expect the decay function to be logarithmic, normal, or even exponential. Indeed, such models do exist, and they are all deemed to be roughly equivalent in accuracy for appropriately tuned parameters. (English, 2008) Furthermore, we include an implementation of a normal growth/decay function in the Mathematica notebook on this blog’s Github page. After reading that all of these models are roughly equivalent, we did not conduct an explicit analysis of the normal model. We leave that as an exercise to the reader, in order to become familiar with the code provided.

In addition, one could augment this model with other kinds of data. If the serial offender targets a specific demographic, then this model could be combined with demographic data to predict the sites of future attacks. It could be (and in some cases has been) weighted according to major roadways and freeways, which reduce a criminals mental model of distance to a hunting ground. In other words, we could use the Google Maps “shortest trip” metric between any two points as its distance metric. To our knowledge, this has not been implemented with any established mapping software. We imagine that such an implementation would be slow; but then again, a distributed network of computers computing the values for each cell in parallel would be quick.

## Other Uses for the Model

In addition to profiling serial murders, we have read of other uses for this sort of geographic profiling model.

First, there is an established paper on the use of geographic profiling to describe the hunting patterns of great white sharks. Briefly, we recognize that such a model would switch from a taxicab metric to a standard Euclidean metric, since the movement space of the ocean is locally homeomorphic to three-dimensional Euclidean space. Indeed, we might also require a three-dimensional probability map for shark predation, since sharks may swim up or down to find prey. Furthermore, shark swimming patterns are likely not uniformly random in any direction, so this model is weighted to consider that.

Finally, we haphazardly propose additional uses for this model: pinpointing the location of stationary artillery, locating terrorist base camps, finding the source of disease outbreaks, and profiling other minor serial-type criminals, like graffiti vandalists.

## Data! Data! My Kingdom for Some Data!

As recent as 2000, one researcher noted that the best source of geographic criminal data was newspaper archives. In the age of information, and given the recent popularity of geographic profiling research, this is a sad state of being. As far as we know, there are no publicly available indexes of geographic crime location data. As of the writing of this post, an inquiry to a group of machine learning specialists has produced no results. There doesn’t seem to be such a forum for criminology experts.

If any readers have information to crime series data that is publicly available on the internet (likely used in some professor’s research and posted on their website), please don’t hesitate to leave a comment with a link. It would be greatly appreciated.