Infinitely Many Primes (Using Topology)

Problem: Prove there are infinitely many prime numbers.

Solution: First recall that an arithmetic progression with difference d is a sequence of integers a_n \subset \mathbb{Z} so that for every pair a_k, a_{k+1} the difference a_{k+1} - a_k = d.

We proceed be defining a topology on the set of integers by defining a basis B of unbounded (in both directions) arithmetic progressions. That is, an open set in this topology is an arbitrary union of arithmetic progressions from -\infty to \infty. To verify this makes sense, we need only check that B is a basis. Indeed, B covers \mathbb{Z} since \mathbb{Z} is itself an arithmetic progression with difference 1. And any two arithmetic progressions a_n, b_n have intersection which is again an arithmetic progression: if the two progressions have equal difference then the intersection is empty or a_n = b_n, and if their differences are d_1 \neq d_2, then the intersection has difference \textup{lcm}(d_1, d_2). (We discussed this idea in a different context in our post on design with prime numbers.)

Now we notice that the basis sets are both open (by definition) and closed. The latter follows from the fact that there are only finitely many shifts of an arithmetic sequence. That is, if a_n is an arithmetic sequence with difference d, then its complement is the union of all a_n + k where k ranges from 1 to d-1 (and each of these sets are open, so the union is open as well).

Further we notice that no finite set can be open. Indeed, since the arithmetic sequences which form our basis are all infinite, and any open set is a union of sets in a basis, every open set in this topology must be infinite.

Now for a prime p, let U_p be the arithmetic progression with difference p containing zero, and let U be the union of all U_p. Supposing there were only finitely many primes, U is a finite union of closed sets and hence closed in this topology. But the only integers which are not multiples of some prime are \pm 1. So \mathbb{Z} - U = \{ 1, -1 \}. Since U is closed, its complement is open, but as we mentioned above no finite sets are open. This is a contradiction, and so there must be infinitely many primes. \square

Discussion: Even though the definition of a topology is basic knowledge for a mathematician, this proof shows that a topology alone can carry a lot of interesting structure. Even more so, the ability to pick a topology on a space to suit one’s needs is a powerful tool. It reminds this author of the more elementary idea of picking colorings on a chessboard. Indeed, the entire field of algebraic geometry lives in the setting of one particular kind of topology on real or projective space. As one might expect, the open sets in a custom topology must be related to the task at hand. In the case of algebraic geometry, they’re the solution sets of families of polynomials in finitely many variables. Considering how important polynomials are in mathematics, one can imagine how complex and rich the resulting theories are. We plan to cover both basic topology and algebraic geometry in the future of this blog.

About these ads

K-Nearest-Neighbors and Handwritten Digit Classification

The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

The Axiom of Neighborliness: Let (X, d) be a metric space and let S \subset X be a finite set whose elements are classified by some function f : S \to \left \{ 1, 2, \dots, m \right \}. We say that S satisfies the axiom of neighborliness if for every point x \in S, if y is the closest point to x, then f(x) = f(y). That is, y shares the same class as x if y is the nearest neighbor to x.

For a more in-depth discussion of metrics, the reader should refer to this blog’s primer on the topic. For the purpose of this post and all foreseeable posts, X will always be \mathbb{R}^n for some n, while the metric d will vary.

This axiom is actually a very strong assumption which is certainly not true of every data set. In particular, it highly depends on the problem setup. Having the wrong kinds or the wrong number of features, doing an improper conversion, or using the wrong metric can all invalidate the assumption even if the problem inherently has the needed structure. Luckily, for real-world applications we only need the data to adhere to the axiom of neighborliness in approximation (indeed, in practice the axiom is only verifiable in approximation). Of course, what we mean by “approximation” also depends on the problem and the user’s tolerance for error. Such is the nature of applied mathematics.

Once we understand the axiom, the machine learning “algorithm” is essentially obvious. For training, store a number of data points whose classes are known and fix a metric. To determine the class of an unknown data point, simply use the most common class of its nearest neighbors. As one may vary (as a global parameter) the number of neighbors one considers, this method is intuitively called k-nearest-neighbors.

The Most Basic Way to Learn: Copy Your Neighbors

Let’s iron out the details with a program and test it on some dummy data. Let’s construct a set of points in \mathbb{R}^2 which manifestly satisfies the axiom of neighborliness. To do this, we’ll use Python’s random library to make a dataset sampled from two independent normal distributions.

import random

def gaussCluster(center, stdDev, count=50):
    return [(random.gauss(center[0], stdDev),
             random.gauss(center[1], stdDev)) for _ in range(count)]

def makeDummyData():
    return gaussCluster((-4,0), 1) + gaussCluster((4,0), 1)

The first function simply returns a cluster of points drawn from the specified normal distribution. For simplicity we equalize the covariance of the two random variables. The second function simply combines two clusters into a data set.

To give the dummy data class “labels,” we’ll simply have a second list that we keep alongside the data. The index of a data point in the first list corresponds to the index of its class label in the second. There are likely more elegant ways to organize this, but it suffices for now.

Implementing a metric is similarly straightforward. For now, we’ll use the standard Euclidean metric. That is, we simply take the sum of the squared differences of the coordinates of the given two points.

import math

def euclideanDistance(x,y):
    return math.sqrt(sum([(a-b)**2 for (a,b) in zip(x,y)]))

To actually implement the classifier, we create a function which itself returns a function.

import heapq

def makeKNNClassifier(data, labels, k, distance):
    def classify(x):
        closestPoints = heapq.nsmallest(k, enumerate(data),
                                        key=lambda y: distance(x, y[1]))
        closestLabels = [labels[i] for (i, pt) in closestPoints]
        return max(set(closestLabels), key=closestLabels.count)

    return classify

There are a few tricky things going on in this function that deserve discussion. First and foremost, we are defining a function within another function, and returning the created function. The important technical point here is that the created function retains all local variables which are in scope even after the function ends. Specifically, you can call “makeKNNClassifier” multiple times with different arguments, and the returned functions won’t interfere with each other. One is said to close over the values in the environment, and so this programming language feature is called a function closure or just a closure, for short. It allows us, for instance, to keep important data visible while hiding any low-level data it depends on, but which we don’t access directly. From a high level, the decision function entirely represents the logic of the program, and so this view is justified.

Second, we are using some relatively Pythonic constructions. The first line of “classify” uses of heapq to pick the k smallest elements of the data list, but in addition we use “enumerate” to preserve the index of the returned elements, and a custom key to have the judgement of “smallest” be determined by the custom distance function. Note that the indexed “y[1]” in the lambda function uses the point represented by “y” and not the saved index.

The second line simply extracts a list of the labels corresponding each of the closest points returned by the call to “nsmallest.” Finally, the third line returns the maximum of the given labels, where a label’s weight (determined by the poorly named “key” lambda) is its frequency in the “closestLabels” list.

Using these functions is quite simple:

trainingPoints = makeDummyData() # has 50 points from each class
trainingLabels = [1] * 50 + [2] * 50  # an arbitrary choice of labeling

f = makeKNNClassifier(trainingPoints, trainingLabels, 8, euclideanDistance)
print f((-3,0))

The reader may fiddle around with this example as desired, but we will not pursue it further. As usual, all code used in this post is available on this blog’s Google code page. Let’s move on to something more difficult.

Handwritten Digits

One of the most classic examples in the classification literature is in recognizing handwritten digits. This originally showed up (as the legend goes) in the context of the United States Postal Service for the purpose of automatically sorting mail by the zip code of the destination. Although this author has no quantitative proof, the successful implementation of a scheme would likely save an enormous amount of labor and money. According to the Postal Facts site, there are 31,509 postal offices in the U.S. and, assuming each one processes mail, there is at least one employee at each office who would spend some time sorting by zip code. Given that the USPS processes 23 million pieces of mail per hour, a conservative estimate puts each office spending two hours of labor per day on sorting mail by zip code (resulting in a very rapid pace of 146.52 pieces of mail sorted per minute per worker). At a lower bound of $18/hr this amounts to a cost of $1,134,324 per day, or over 400 million dollars per year. Put in perspective, in one year the amount of money saved equals the entire two-year tuition of Moraine Valley Community College for 68,000 students (twice the current enrollment).

In short, the problem of sorting mail (and of classifying handwritten digits) begs to be automated, and indeed it has been to some degree for about four decades. Let’s see how k-nearest-neighbors fares in this realm.

We obtain our data from the UCI machine learning repository, and with a few minor modifications, we present it on this blog’s Google Code page (along with the rest of the code used in this post). A single line of the data file represents a handwritten digit and its label. The digit is a 256-element vector obtained by flattening a 16×16 binary-valued image in row-major order; the label is an integer representing the number in the picture. The data file contains 1593 instances with about 160 instances per digit.

In other words, our metric space is \left \{ 0,1 \right \}^{256}, and we choose the Euclidean metric for simplicity. With the line wrapping to better display the “image,” one line from the data file looks like:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 
0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 
0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 
1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0, 6

After reading in the data appropriately, we randomly split the data set into two pieces, train on one piece, and test on the other. The following function does this,  returning the success rate of the classification algorithm on the testing piece.

import knn
import random

def column(A, j):
   return [row[j] for row in A]

def test(data, k):
   random.shuffle(data)
   pts, labels = column(data, 0), column(data, 1)

   trainingData = pts[:800]
   trainingLabels = labels[:800]
   testData = pts[800:]
   testLabels = labels[800:]

   f = knn.makeKNNClassifier(trainingData, trainingLabels,
                             k, knn.euclideanDistance)
   correct = 0
   total = len(testLabels)

   for (point, label) in zip(testData, testLabels):
      if f(point) == label:
         correct += 1

   return float(correct) / total

A run with k=1 gives a surprisingly good 89% success rate. Varying k, we see this is about as good as it gets without any modifications to the algorithm or the metric. Indeed, the graph below shows that the handwritten digits data set agrees with the axiom of neighborliness to a fair approximation.

A graph of classification accuracy against k for values of k between 1 and 50. The graph clearly shows a downward trend as k increases, but all values k < 10 are comparably good.

Of course, there are many improvements we could make to this naive algorithm. But considering that it utilizes no domain knowledge and doesn’t manipulate the input data in any way, it’s not too shabby.

As a side note, it would be fun to get some tablet software and have it use this method to recognize numbers as one writes it. Alas, we have little time for these sorts of applications.

Advantages, Enhancements, and Problems

One reason k-nearest-neighbors is such a common and widely-known algorithm is its ease of implementation. Indeed, we implemented the core algorithm in a mere three lines of Python. On top of that, k-nearest-neighbors is pleasingly parallel, and inherently flexible. Unlike the Perceptron algorithm, which relies on linear separability, k-nearest-neighbors and the axiom of neighborliness allow for datasets with many different geometric structures. These lecture notes give a good example, as shown below, and the reader can surely conjure many more.

k-nearest-neighbors applied to a data set organized in concentric circles.

And of course, the flexibility is even greater by virtue of being able to use any metric for distance computations. One may, for instance, use the Manhattan metric if the points in question are locations in a city. Or if the data is sequential, one could use the dynamic time warping distance (which isn’t truly a metric, but is still useful). The possibilities are only limited by the discovery of new and useful metrics.

With such popularity, k-nearest-neighbors often comes with a number of modifications and enhancements. One enhancement is to heuristically remove certain points close to the decision boundary. This technique is called edited k-nearest-neighbors. Another is to weight certain features heavier in the distance computations, which requires one to programmatically determine which features help less with classification. This is getting close to the realm of a decision tree, and so we’ll leave this as an exercise to the reader.

The next improvement has to do with runtime. Given n training points and d features (d for dimension), one point requires O(nd) to classify. This is particularly expensive, because most of the distance computations performed are between points that are far away, and as k is usually small, they won’t influence in the classification.

One way to alleviate this is to store the data points in a data structure called a k-d tree. The k-d tree originated in computational geometry in the problem of point location. It partitions space into pieces based on the number of points in each resulting piece, and organizes the partitions into a tree. In other words, it will partition tightly where the points are dense, and loosely where the points are sparse. At each step of traversing the tree, one can check to see which sub-partition the unclassified point lies in, and descend appropriately. With certain guarantees, this reduces the computation to O(\log(n)d). Unfortunately, there are issues with large-dimensional spaces that are beyond the scope of this post. We plan to investigate k-d trees further in a future series on computational geometry.

The last issue we consider is in data scaling. Specifically, one needs to be very careful when converting the real world data into numerical data. We can think of each of the features as a random variable, and we want all of these random variables to have comparable variation. The reason is simply because we’re using spheres. One can describe k-nearest-neighbors as finding the smallest (filled-in) sphere centered at the unlabeled point containing k labeled data points, and using the most common of those labels to classify the new point. Of course, one can talk about “spheres” in any metric space; it’s just the set of all points within some fixed distance from the center (and this definition doesn’t depend on the dimension of the space). The important point is that a sphere has uniform length along every axis. If the data is scaled improperly, then the geometry of the sphere won’t mirror the geometry of the data, and the algorithm will flounder.

So now we’ve seen a smattering of topics about k-nearest-neighbors. We’d love to continue the discussion of modifications in the comments. Next time we’ll explore decision trees, and work with another data set. Until then!

The Cellular Automaton Method for Cave Generation

Dear reader, this post has an interactive simulation! We encourage you to play with it as you read the article below.

In our series of posts on cellular automata, we explored Conway’s classic Game of Life and discovered some interesting patterns therein. And then in our primers on computing theory, we built up a theoretical foundation for similar kinds of machines, including a discussion of Turing machines and the various computational complexity classes surrounding them. But cellular automata served us pretty exclusively as a toy. It was a basic model of computation, which we were interested in only for its theoretical universality. One wouldn’t expect too many immediately practical (and efficient) applications of something which needs a ridiculous scale to perform basic logic. In fact, it’s amazing that there are as many as there are.

In this post we’ll look at one particular application of cellular automata to procedural level generation in games.

An example of a non-randomly generated cave level from Bethesda’s The Elder Scrolls series.

The Need for More Caves

Level design in video games is a time-consuming and difficult task. It’s extremely difficult for humans to hand-craft areas that both look natural and are simultaneously fun to play in. This is particularly true of the multitude of contemporary role-playing games modeled after Dungeons and Dragons, in which players move through a series of areas defeating enemies, collecting items, and developing their character. With a high demand for such games and so many levels in each game, it would save an unfathomable amount of money to have computers generate the levels on the fly. Perhaps more importantly, a game with randomly generated levels inherently has a much higher replay value.

The idea of randomized content generation (often called procedural generation) is not particularly new. It has been around at least since the 1980′s. Back then, computers simply didn’t have enough space to store large, complex levels in memory. To circumvent this problem, video game designers simply generated the world as the player moved through it. This opened up an infinitude of possible worlds for the user to play in, and the seminal example of this is a game called Rogue, which has since inspired series such as Diablo, Dwarf Fortress, and many many others. The techniques used to design these levels have since been refined and expanded into a toolbox of techniques which have become ubiquitous in computer graphics and game development.

We’ll explore more of these techniques in the future, but for now we’ll see how a cellular automaton can be used to procedurally generate two-dimensional cave-like maps.

A Quick Review of Cellular Automata

While the interested reader can read more about cellular automata on this blog, we will give a quick refresher here.

For our purposes here, a 2-dimensional cellular automaton is a grid of cells G, where each cell c \in G is in one of a fixed number of states, and has a pre-determined and fixed set of neighbors. Then G is updated by applying a fixed rule to each cell simultaneously, and the process is repeated until something interesting happens or boredom strikes the observer. The most common kind of cellular automaton, called a ‘Life-like automaton,’ has only two states, ‘dead’ and ‘alive’ (for us, 0 and 1), and the rule applied to each cell is given as conditions to be ‘born’ or ‘survive’ based on the number of adjacent live cells. This is often denoted “Bx/Sy” where x and y are lists of single digit numbers. Furthermore, the choice of neighborhood is the eight nearest cells (i.e., including the diagonally-adjacent ones). For instance, B3/S23 is the cellular automaton rule where a cell is born if it has three living neighbors, and it survives if it has either two or three living neighbors, and dies otherwise. Technically, these are called ‘Life-like automata,’ because they are modest generalizations of Conway’s original Game of Life. We give an example of a B3/S23 cellular automaton initialized by a finite grid of randomly populated cells below. Note that each of the black (live) cells in the resulting stationary objects satisfy the S23 part of the rule, but none of the neighboring white (dead) cells satisfy the B3 condition.

A cellular automaton should really be defined for an arbitrary graph (or more generally, an arbitrary state space). There is really nothing special about a grid other than that it’s easy to visualize. Indeed, some cellular automata are designed for hexagonal grids, others are embedded on a torus, and still others are one- or three-dimensional. Of course, nothing stops automata from existing in arbitrary dimension, or from operating with arbitrary (albeit deterministic) rules, but to avoid pedantry we won’t delve into a general definition here. It would take us into a discussion of discrete dynamical systems (of which there are many, often with interesting pictures).

It All Boils Down to a Simple Rule

Now the particular cellular automaton we will use for cave generation is simply B678/S345678, applied to a random initial grid with a fixed live border. We interpret the live cells as walls, and the dead cells as open space. This rule should intuitively work: walls will stay walls even if more cells are born nearby, but isolated or near-isolated cells will often be removed. In other words, this cellular automaton should ‘smooth out’ a grid arrangement to some extent. Here is an example animation quickly sketched up in Mathematica to witness the automaton in action:

An example cave generated via the automaton rule B678/S345678. The black cells are alive, and the white cells are dead.

As usual, the code to generate this animation (which is only a slight alteration to the code used in our post on cellular automata) is available on this blog’s Google Code page.

This map is already pretty great! It has a number of large open caverns, and they are connected by relatively small passageways. With a bit of imagination, it looks absolutely cavelike!

We should immediately note that there is no guarantee that the resulting regions of whitespace will be connected. We got lucky with this animation, in that there are only two disconnected components, and one is quite small. But in fact one can be left with multiple large caves which have no connecting paths.

Furthermore, we should note the automaton’s rapid convergence to a stable state. Unlike Conway’s Game of Life, in practice this automaton almost always converges within 15 steps, and this author has yet to see any oscillatory patterns. Indeed, they are unlikely to exist because the survival rate is so high, and our initial grid has an even proportion of live and dead cells. There is no overpopulation that causes cells to die off, so once a cell is born it will always survive. The only cells that do not survive are those that begin isolated. In a sense, B678/S345678 is designed to prune the sparse areas of the grid, and fill in the dense areas by patching up holes.

We should also note that the initial proportion of cells which are alive has a strong effect on the density of the resulting picture.  For the animation we displayed above, we initially chose that 45% of the cells would be live. If we increase that a mere 5%, we get a picture like the following.

A cave generated with the initial proportion of live cells equal to 0.5

As expected, there are many more disconnected caverns. Some game designers prefer a denser grid combined with heuristic methods to connect the caverns. Since our goal is just to explore the mathematical ideas, we will leave this as a parameter in our final program.

Javascript Implementation, and Greater Resolution

One important thing to note is that B678/S345678 doesn’t scale well to fine grid sizes. For instance, if we increase the grid size to 200×200, we get something resembling an awkward camouflage pattern.

A 200×200 grid cave generation. Click the image to enlarge it.

What we really want is a way to achieve the major features of the low-resolution image on a larger grid. Since cellular automata are inherently local manipulations, we should not expect any modification of B678/S345678 to do this for us. Instead, we will use B678/345678 to create a low-resolution image, increase its resolution manually, and smooth it out with — you guessed it — another cellular automaton! We’ll design this automaton specifically for the purpose of smoothing out corners.

To increase the resolution, we may simply divide the cells into four pieces. The picture doesn’t change, but the total number of cells is squared. There are a few ways to do this programmatically, but the way we chose simply uses the smallest resolution possible, and simulates higher resolution by doing block computations. The interested programmer can view our Javascript program available on this blog’s Google code page to see this directly (or view the page source of this post’s interactive simulator).

To design a “smoothing automaton,” we should investigate more closely what we need to improve on in the above examples. In particular, once we increase the resolution, we will have a lot of undesirable convex and concave corners. Since a “corner” is simply a block satisfying certain local properties, we can single those out to be removed by an automaton. It’s easy to see that convex corners have exactly 3 live neighbors, so we should not allow those cells to survive. Similarly, the white cell just outside a concave corner has 5 live neighbors, so we should allow that cell to be born. On the other hand, we still want the major properties of our old B678/S345678 to still apply, so we can simply add 5 to the B part and remove 3 from the S part. Lastly, for empirical reasons, we also decide to kill off cells with 4 live neighbors.

And so our final “smoothing automaton” is simply B5678/S5678.

We present this application as an interactive javascript program. Some basic instructions:

  • The “Apply B678/S345678″ button does what you’d expect: it applies B678/S345678 to the currently displayed grid. It iterates the automaton 20 times in an animation.
  • The “Apply B5678/S5678″ button applies the smoothing automaton, but it does so only once, allowing the user to control the degree of smoothing at the specific resolution level.
  • The “Increase Resolution” button splits each cell into four, and may be applied until the cell size is down to a single pixel.
  • The “Reset” button resets the entire application, creating a new random grid.

We used this program to generate a few interesting looking pictures by varying the order in which we pressed the various buttons (it sounds silly, but it’s an exploration!). First, a nice cave:

An example of a higher resolution cave created with our program. In order to achieve similar results, First apply B678/S345678, and then alternate increasing the resolution and applying B5678/S5678 1-3 times.

We note that this is not perfect. There are some obvious and awkward geometric artifacts lingering in this map, mostly in the form of awkwardly straight diagonal lines and awkwardly flawless circles. Perhaps one might imagine the circles are the bases of stalactites or stalagmites. But on the whole, in terms of keeping the major features of the original automaton present while smoothing out corners, this author thinks B5678/S5678 has done a phenomenal job. Further to the cellular automaton’s defense, when the local properties are applied uniformly across the entire grid, such regularities are bound to occur. That’s just another statement of the non-chaotic nature of B5678/S5678 (in stark contrast to Conway’s Game of Life).

There are various modifications one could perform (or choose not to, depending on the type of game) to make the result more accessible for the player. For instance, one could remove all regions which fit inside a sufficiently small circle, or add connections between the disconnected components at some level of resolution. This would require some sort of connected-component labeling, which is a nontrivial task; current research goes into optimizing connected-component algorithms for large-scale grids. We plan to cover such topics on this blog in the future.

Another example of a cool picture we created with this application might be considered a more “retro” style of cave.

Apply S678/B345678 once, and increase the resolution as much as possible before applying B5678/S5678 as many times as desired.

We encourage the reader to play around with the program to see what other sorts of creations one can make. As of the time of this writing, changing the initial proportion of live cells (50%) or changing the automaton rules cannot be done in the browser; it requires one to modify the source code. We may implement the ability to control these in the browser given popular demand, but (of course) it would be a wonderful exercise for the intermediate Javascript programmer.

Caves in Three Dimensions

It’s clear that this same method can be extended to a three-dimensional model for generating caverns in a game like Minecraft. While we haven’t personally experimented with three-dimensional cellular automata here on this blog, it’s far from a new idea. Once we reach graphics programming on this blog (think: distant future) we plan to revisit the topic and see what we can do.

Until then!

Dynamic Time Warping for Sequence Comparison

Problem: Write a program that compares two sequences of differing lengths for similarity.

Solution: (In Python)

import math

def dynamicTimeWarp(seqA, seqB, d = lambda x,y: abs(x-y)):
    # create the cost matrix
    numRows, numCols = len(seqA), len(seqB)
    cost = [[0 for _ in range(numCols)] for _ in range(numRows)]

    # initialize the first row and column
    cost[0][0] = d(seqA[0], seqB[0])
    for i in xrange(1, numRows):
        cost[i][0] = cost[i-1][0] + d(seqA[i], seqB[0])

    for j in xrange(1, numCols):
        cost[0][j] = cost[0][j-1] + d(seqA[0], seqB[j])

    # fill in the rest of the matrix
    for i in xrange(1, numRows):
        for j in xrange(1, numCols):
            choices = cost[i-1][j], cost[i][j-1], cost[i-1][j-1]
            cost[i][j] = min(choices) + d(seqA[i], seqB[j])

    for row in cost:
       for entry in row:
          print "%03d" % entry,
       print ""
    return cost[-1][-1]

DiscussionComparing sequences of numbers can be tricky business. The simplest way to do so is simply component-wise. However, this will often disregard more abstract features of a sequence that we intuitively understand as “shape.”

For example, consider the following two sequences.

0 0 0 3 6 13 25 22 7 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 12 24 23 8 3 1 0 0 0 0 0

They both have the same characteristics — a bump of height roughly 25 and length 8 — but comparing the two sequences entrywise would imply they are not similar. According to the standard Euclidean norm, they are 52 units apart. For motivation, according to the dynamic time warping function above, they are a mere 7 units apart. Indeed, if the two bumps consisted of the same numbers, the dynamic time warp distance between the entire sequences would be zero.

These kinds of sequences show up in many applications. For instance, in speech recognition software one often has many samples of the a single person speaking, but there is a difference in the instant during the sample at which the person begins speaking. Similarly, the rate at which the person speaks may be slightly different. In either event, we want the computed “distance” between two such samples to be small. That is, we’re looking for a measurement which is time-insensitive both in scaling and in position. Historically, the dynamic time warping solution was designed in 1978 to solve this problem.

The idea is similar to the Levenshtein metric we implemented for “edit distance” measurements in our Metrics on Words post. Instead of inserting or deleting sequence elements, we may optionally “pause” our usual elementwise comparison on one sequence while continuing on the other. The trick is in deciding whether to “pause,” and when to switch the “pausing” from one sequence to the other; this will require a dynamic program (as did the Levenshtein metric).

In order to compare two sequences, one needs to have some notion of a local comparison. That is, we need to be able to compare any entry from one sequence to any entry in the other. While there are many such options that depend on the data type used in the sequence, we will use the following simple numeric metric:

\displaystyle d(x,y) = |x-y|

This is just the Euclidean metric in \mathbb{R}. More generally, this assumes that two numbers are similar when they are close together (and this is in fact an important assumption; not all number systems are like this).

Now given two sequences a_i, b_j, we can compare them by comparing their local distance for a specially chosen set of indices given by m_k for a_i and n_k for b_j. That is, the dynamic time warping distance will end up being the quantity:

\displaystyle C(a_i, b_j) = \sum_{k=0}^{M} d(a_{m_k}, b_{n_k})

Of course, we should constrain the indices m_k, n_k so that the result is reasonable. A good way to do that is to describe the conditions we want it to satisfy, and then figure out how to compute such indices. In particular, let us assume that a_i has length M, b_j has length N. Then we have the following definition.

Definition: A warping path for a_i, b_j is a pair of sequences (m_k, n_k), both of some length L, satisfying the following conditions:

  1. 1 \leq m_k \leq M and 1 \leq n_k \leq N for all k.
  2. The sequences have endpoints (m_1, n_1) = (1,1), (m_L, n_L) = (M, N)
  3. The seqences m_k, n_k are monotonically increasing.
  4. (m_k - m_{k-1}, n_k - n_{k-1}) must be one of (1,0), (0,1), (1,1).

The first condition is obvious, or else the m_k, n_k could not be indexing a_i, b_j. The second condition ensures that we use all of the information in both sequences in our comparison. The third condition implies that we cannot “step backward” in time as we compare local sequence entries. We wonder offhand if anything interesting could be salvaged from the mess that would ensue if one left this condition out. And the fourth condition allows the index of one sequence to be stopped while the other continues. This condition creates the “time warping” effect, that some parts of the sequence can be squeezed or stretched in comparison with the other sequence.

We also note as an aside that the fourth condition listed implies the third.

Of course there are many valid warping paths for any two sequences, but we need to pick one which has our desired feature. That is, it minimizes the sum of the local comparisons (the formula for C(a_i, b_j) above). We denote this optimal value as DTW(a_i, b_j).

The fourth condition in the list above should give it away that to compute the optimal path requires a dynamic program. Specifically, the optimal path can be computed by solving the three sub-problems of finding the optimal warping path for

\displaystyle (a_{1 \dots M-1}, b_{1 \dots N}), (a_{1 \dots M}, b_{1 \dots N-1}), \textup{ and } (a_{1 \dots M-1}, b_{1 \dots N-1})

The clueless reader should refer to this blog’s primer on Python and dynamic programming. In any event, the program implementing this dynamic program is given in the solution above.

A visualization of the dynamic time warp cost matrix for two sequences. The algorithm attempts to find the least expensive path from the bottom left to the top right, where the darker regions correspond to low local cost, and the lighter regions to high local cost. The arrows point in the forward direction along each sequence, showing the monotonicity requirement of an optimal warping path.

The applications of this technique certainly go beyond speech recognition. Dynamic time warping can essentially be used to compare any data which can be represented as one-dimensional sequences. This includes video, graphics, financial data, and plenty of others.

We may also play around with which metric is used in the algorithm. When the elements of the lists are themselves points in Euclidean space, you can swap out the standard Euclidean metric with metrics like the Manhattan metric, the maximum metric, the discrete metric, or your mother’s favorite L^p norm.

While we use a metric for elementwise comparisons in the algorithm above, the reader must note that the dynamic time warping distance is not a metricIn fact, it’s quite far from a metric. The casual reader can easily come up with an example of two non-identical sequences x, y for which DTW(x,y) = 0, hence denying positive-definiteness. The more intrepid reader will come up with three sequences which give a counterexample to satisfy the triangle inequality (hint: using the discrete metric as the local comparison metric makes things easier).

In fact, the failure to satisfy positive-definiteness and the triangle inequality means that the dynamic time warping “distance” is not even a semimetric. To be completely pedantic, it would fit in as a symmetric premetric. Unfortunately, this means that we don’t get the benefits of geometry or topology to analyze dynamic time warping as a characteristic feature of the space of all numeric sequences. In any event, it’s still proven to be useful in applications, so it belongs here in the program gallery.