Load Balancing and the Power of Hashing

Here’s a bit of folklore I often hear (and retell) that’s somewhere between a joke and deep wisdom: if you’re doing a software interview that involves some algorithms problem that seems hard, your best bet is to use hash tables.

More succinctly put: Google loves hash tables.

As someone with a passion for math and theoretical CS, it’s kind of silly and reductionist. But if you actually work with terabytes of data that can’t fit on a single machine, it also makes sense.

But to understand why hash tables are so applicable, you should have at least a fuzzy understanding of the math that goes into it, which is surprisingly unrelated to the actual act of hashing. Instead it’s the guarantees that a “random enough” hash provides that makes it so useful. The basic intuition is that if you have an algorithm that works well assuming the input data is completely random, then you can probably get a good guarantee by preprocessing the input by hashing.

In this post I’ll explain the details, and show the application to an important problem that one often faces in dealing with huge amounts of data: how to allocate resources efficiently (load balancing). As usual, all of the code used in the making of this post is available on Github.

Next week, I’ll follow this post up with another application of hashing to estimating the number of distinct items in a set that’s too large to store in memory.

Families of Hash Functions

To emphasize which specific properties of hash functions are important for a given application, we start by introducing an abstraction: a hash function is just some computable function that accepts strings as input and produces numbers between 1 and $ n$ as output. We call the set of allowed inputs $ U$ (for “Universe”). A family of hash functions is just a set of possible hash functions to choose from. We’ll use a scripty $ \mathscr{H}$ for our family, and so every hash function $ h$ in $ \mathscr{H}$ is a function $ h : U \to \{ 1, \dots, n \}$.

You can use a single hash function $ h$ to maintain an unordered set of objects in a computer. The reason this is a problem that needs solving is because if you were to store items sequentially in a list, and if you want to determine if a specific item is already in the list, you need to potentially check every item in the list (or do something fancier). In any event, without hashing you have to spend some non-negligible amount of time searching. With hashing, you can choose the location of an element $ x \in U$ based on the value of its hash $ h(x)$. If you pick your hash function well, then you’ll have very few collisions and can deal with them efficiently. The relevant section on Wikipedia has more about the various techniques to deal with collisions in hash tables specifically, but we want to move beyond that in this post.

Here we have a family of random hash functions. So what’s the use of having many hash functions? You can pick a hash randomly from a “good” family of hash functions. While this doesn’t seem so magical, it has the informal property that it makes arbitrary data “random enough,” so that an algorithm which you designed to work with truly random data will also work with the hashes of arbitrary data. Moreover, even if an adversary knows $ \mathscr{H}$ and knows that you’re picking a hash function at random, there’s no way for the adversary to manufacture problems by feeding bad data. With overwhelming probability the worst-case scenario will not occur. Our first example of this is in load-balancing.

Load balancing and 2-uniformity

You can imagine load balancing in two ways, concretely and mathematically. In the concrete version you have a public-facing server that accepts requests from users, and forwards them to a back-end server which processes them and sends a response to the user. When you have a billion users and a million servers, you want to forward the requests in such a way that no server gets too many requests, or else the users will experience delays. Moreover, you’re worried that the League of Tanzanian Hackers is trying to take down your website by sending you requests in a carefully chosen order so as to screw up your load balancing algorithm.

The mathematical version of this problem usually goes with the metaphor of balls and bins. You have some collection of $ m$ balls and $ n$ bins in which to put the balls, and you want to put the balls into the bins. But there’s a twist: an adversary is throwing balls at you, and you have to put them into the bins before the next ball comes, so you don’t have time to remember (or count) how many balls are in each bin already. You only have time to do a small bit of mental arithmetic, sending ball $ i$ to bin $ f(i)$ where $ f$ is some simple function. Moreover, whatever rule you pick for distributing the balls in the bins, the adversary knows it and will throw balls at you in the worst order possible.

silk-balls.jpg
A young man applying his knowledge of balls and bins. That’s totally what he’s doing.

There is one obvious approach: why not just pick a uniformly random bin for each ball? The problem here is that we need the choice to be persistent. That is, if the adversary throws the same ball at us a second time, we need to put it in the same bin as the first time, and it doesn’t count toward the overall load. This is where the ball/bin metaphor breaks down. In the request/server picture, there is data specific to each user stored on the back-end server between requests (a session), and you need to make sure that data is not lost for some reasonable period of time. And if we were to save a uniform random choice after each request, we’d need to store a number for every request, which is too much. In short, we need the mapping to be persistent, but we also want it to be “like random” in effect.

So what do you do? The idea is to take a “good” family of hash functions $ \mathscr{H}$, pick one $ h \in \mathscr{H}$ uniformly at random for the whole game, and when you get a request/ball $ x \in U$ send it to server/bin $ h(x)$. Note that in this case, the adversary knows your universal family $ \mathscr{H}$ ahead of time, and it knows your algorithm of committing to some single randomly chosen $ h \in \mathscr{H}$, but the adversary does not know which particular $ h$ you chose.

The property of a family of hash functions that makes this strategy work is called 2-universality.

Definition: A family of functions $ \mathscr{H}$ from some universe $ U \to \{ 1, \dots, n \}$. is called 2-universal if, for every two distinct $ x, y \in U$, the probability over the random choice of a hash function $ h$ from $ \mathscr{H}$ that $ h(x) = h(y)$ is at most $ 1/n$. In notation,

$ \displaystyle \Pr_{h \in \mathscr{H}}[h(x) = h(y)] \leq \frac{1}{n}$

I’ll give an example of such a family shortly, but let’s apply this to our load balancing problem. Our load-balancing algorithm would fail if, with even some modest probability, there is some server that receives many more than its fair share ($ m/n$) of the $ m$ requests. If $ \mathscr{H}$ is 2-universal, then we can compute an upper bound on the expected load of a given server, say server 1. Specifically, pick any element $ x$ which hashes to 1 under our randomly chosen $ h$. Then we can compute an upper bound on the expected number of other elements that hash to 1. In this computation we’ll only use the fact that expectation splits over sums, and the definition of 2-universal. Call $ \mathbf{1}_{h(y) = 1}$ the random variable which is zero when $ h(y) \neq 1$ and one when $ h(y) = 1$, and call $ X = \sum_{y \in U} \mathbf{1}_{h(y) = 1}$. In words, $ X$ simply represents the number of inputs that hash to 1. Then

exp-calc

So in expectation we can expect server 1 gets its fair share of requests. And clearly this doesn’t depend on the output hash being 1; it works for any server. There are two obvious questions.

  1. How do we measure the risk that, despite the expectation we computed above, some server is overloaded?
  2. If it seems like (1) is on track to happen, what can you do?

For 1 we’re asking to compute, for a given deviation $ t$, the probability that $ X – \mathbb{E}[X] > t$. This makes more sense if we jump to multiplicative factors, since it’s usually okay for a server to bear twice or three times its usual load, but not like $ \sqrt{n}$ times more than it’s usual load. (Industry experts, please correct me if I’m wrong! I’m far from an expert on the practical details of load balancing.)

So we want to know what is the probability that $ X – \mathbb{E}[X] > t \cdot \mathbb{E}[X]$ for some small number $ t$, and we want this to get small quickly as $ t$ grows. This is where the Chebyshev inequality becomes useful. For those who don’t want to click the link, for our sitauation Chebyshev’s inequality is the statement that, for any random variable $ X$

$ \displaystyle \Pr[|X – \mathbb{E}[X]| > t\mathbb{E}[X]] \leq \frac{\textup{Var}[X]}{t^2 \mathbb{E}^2[X]}.$

So all we need to do is compute the variance of the load of a server. It’s a bit of a hairy calculation to write down, but rest assured it doesn’t use anything fancier than the linearity of expectation and 2-universality. Let’s dive in. We start by writing the definition of variance as an expectation, and then we split $ X$ up into its parts, expand the product and group the parts.

$ \displaystyle \textup{Var}[X] = \mathbb{E}[(X – \mathbb{E}[X])^2] = \mathbb{E}[X^2] – (\mathbb{E}[X])^2$

The easy part is $ (\mathbb{E}[X])^2$, it’s just $ (1 + (m-1)/n)^2$, and the hard part is $ \mathbb{E}[X^2]$. So let’s compute that

esquared-calcluation

In order to continue (and get a reasonable bound) we need an additional property of our hash family which is not immediately spelled out by 2-universality. Specifically, we need that for every $ h$ and $ i$, $ \Pr_x[h(x) = i] = O(\frac{1}{n})$. In other words, each hash function should evenly split the inputs across servers.

The reason this helps is because we can split $ \Pr[h(x) = h(y) = 1]$  into $ \Pr[h(x) = h(y) \mid h(x) = 1] \cdot \Pr[h(x) = 1]$. Using 2-universality to bound the left term, this quantity is at most $ 1/n^2$, and since there are $ \binom{m}{2}$ total terms in the double sum above, the whole thing is at most $ O(m/n + m^2 / n^2) = O(m^2 / n^2)$. Note that in our big-O analysis we’re assuming $ m$ is much bigger than $ n$.

Sweeping some of the details inside the big-O, this means that our variance is $ O(m^2/n^2)$, and so our bound on the deviation of $ X$ from its expectation by a multiplicative factor of $ t$ is at most $ O(1/t^2)$.

Now we computed a bound on the probability that a single server is not overloaded, but if we want to extend that to the worst-case server, the typical probability technique is to take the union bound over all servers. This means we just add up all the individual bounds and ignore how they relate. So the probability that some server has a load more than a multiplicative factor of $ t$ is bounded from above $ O(n/t^2)$. This is only less than one when $ t = \Omega(\sqrt{n})$, so all we can say with this analysis is that (with some small constant probability) no server will have a load worse than $ \sqrt{n}$ times more than the expected load.

So we have this analysis that seems not so good. If we have a million servers then the worst load on one server could potentially be a thousand times higher than the expected load. This doesn’t scale, and the problem could be in any (or all) of three places:

  1. Our analysis is weak, and we should use tighter bounds because the true max load is actually much smaller.
  2. Our hash families don’t have strong enough properties, and we should beef those up to get tighter bounds.
  3. The whole algorithm sucks and needs to be improved.

It turns out all three are true. One heuristic solution is easy and avoids all math. Have some second server (which does not process requests) count hash collisions. When some server exceeds a factor of $ t$ more than the expected load, send a message to the load balancer to randomly pick a new hash function from $ \mathscr{H}$ and for any requests that don’t have existing sessions (this is included in the request data), use the new hash function. Once the old sessions expire, switch any new incoming requests from those IPs over to the new hash function.

But there are much better solutions out there. Unfortunately their analyses are too long for a blog post (they fill multiple research papers). Fortunately their descriptions and guarantees are easy to describe, and they’re easy to program. The basic idea goes by the name “the power of two choices,” which we explored on this blog in a completely different context of random graphs.

In more detail, the idea is that you start by picking two random hash functions $ h_1, h_2 \in \mathscr{H}$, and when you get a new request, you compute both hashes, inspect the load of the two servers indexed by those hashes, and send the request to the server with the smaller load.

This has the disadvantage of requiring bidirectional talk between the load balancer and the server, rather than obliviously forwarding requests. But the advantage is an exponential decrease in the worst-case maximum load. In particular, the following theorem holds for the case where the hashes are fully random.

Theorem: Suppose one places $ m$ balls into $ n$ bins in order according to the following procedure: for each ball pick two uniformly random and independent integers $ 1 \leq i,j \leq n$, and place the ball into the bin with the smallest current size. If there are ties pick the bin with the smaller index. Then with high probability the largest bin has no more than $ \Theta(m/n) + O(\log \log (n))$ balls.

This theorem appears to have been proved in a few different forms, with the best analysis being by Berenbrink et al. You can improve the constant on the $ \log \log n$ by computing more than 2 hashes. How does this relate to a good family of hash functions, which is not quite fully random? Let’s explore the answer by implementing the algorithm in python.

An example of universal hash functions, and the load balancing algorithm

In order to implement the load balancer, we need to have some good hash functions under our belt. We’ll go with the simplest example of a hash function that’s easy to prove nice properties for. Specifically each hash in our family just performs some arithmetic modulo a random prime.

Definition: Pick any prime $ p > m$, and for any $ 1 \leq a < p$ and $ 0 \leq b \leq n$ define $ h_{a,b}(x) = (ax + b \mod p) \mod m$. Let $ \mathscr{H} = \{ h_{a,b} \mid 0 \leq b < p, 1 \leq a < p \}$.

This family of hash functions is 2-universal.

Theorem: For every $ x \neq y \in \{0, \dots, p\}$,

$ \Pr_{h \in \mathscr{H}}[h(x) = h(y)] \leq 1/p$

Proof. To say that $ h(x) = h(y)$ is to say that $ ax+b = ay+b + i \cdot m \mod p$ for some integer $ i$. I.e., the two remainders of $ ax+b$ and $ ay+b$ are equivalent mod $ m$. The $ b$’s cancel and we can solve for $ a$

$ a = im (x-y)^{-1} \mod p$

Since $ a \neq 0$, there are $ p-1$ possible choices for $ a$. Moreover, there is no point to pick $ i$ bigger than $ p/m$ since we’re working modulo $ p$. So there are $ (p-1)/m$ possible values for the right hand side of the above equation. So if we chose them uniformly at random, (remember, $ x-y$ is fixed ahead of time, so the only choice is $ a, i$), then there is a $ (p-1)/m$ out of $ p-1$ chance that the equality holds, which is at most $ 1/m$. (To be exact you should account for taking a floor of $ (p-1)/m$ when $ m$ does not evenly divide $ p-1$, but it only decreases the overall probability.)

$ \square$

If $ m$ and $ p$ were equal then this would be even more trivial: it’s just the fact that there is a unique line passing through any two distinct points. While that’s obviously true from standard geometry, it is also true when you work with arithmetic modulo a prime. In fact, it works using arithmetic over any field.

Implementing these hash functions is easier than shooting fish in a barrel.

import random

def draw(p, m):
   a = random.randint(1, p-1)
   b = random.randint(0, p-1)

   return lambda x: ((a*x + b) % p) % m

To encapsulate the process a little bit we implemented a UniversalHashFamily class which computes a random probable prime to use as the modulus and stores $ m$. The interested reader can see the Github repository for more.

If we try to run this and feed in a large range of inputs, we can see how the outputs are distributed. In this example $ m$ is a hundred thousand and $ n$ is a hundred (it’s not two terabytes, but give me some slack it’s a demo and I’ve only got my desktop!). So the expected bin size for any 2-universal family is just about 1,000.

>>> m = 100000
>>> n = 100
>>> H = UniversalHashFamily(numBins=n, primeBounds=[n, 2*n])
>>> results = []
>>> for simulation in range(100):
...    bins = [0] * n
...    h = H.draw()
...    for i in range(m):
...       bins[h(i)] += 1
...    results.append(max(bins))
...
>>> max(bins) # a single run
1228
>>> min(bins)
613
>>> max(results) # the max bin size over all runs
1228
>>> min(results)
1227

Indeed, the max is very close to the expected value.

But this example is misleading, because the point of this was that some adversary would try to screw us over by picking a worst-case input. If the adversary knew exactly which $ h$ was chosen (which it doesn’t) then the worst case input would be the set of all inputs that have the given hash output value. Let’s see it happen live.

>>> h = H.draw()
>>> badInputs = [i for i in range(m) if h(i) == 9]
>>>; len(badInputs)
1227
>>> testInputs(n,m,badInputs,hashFunction=h)
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1227, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The expected size of a bin is 12, but as expected this is 100 times worse (linearly worse in $ n$). But if we instead pick a random $ h$ after the bad inputs are chosen, the result is much better.

>>> testInputs(n,m,badInputs) # randomly picks a hash
[19, 20, 20, 19, 18, 18, 17, 16, 16, 16, 16, 17, 18, 18, 19, 20, 20, 19, 18, 17, 17, 16, 16, 16, 16, 17, 18, 18, 19, 20, 20, 19, 18, 17, 17, 16, 16, 16, 16, 8, 8, 9, 9, 10, 10, 10, 10, 9, 9, 8, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 10, 9, 9, 8, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 10, 9, 8, 8, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 10, 9, 8, 8, 8, 8, 8, 8, 8, 9, 9, 10]

However, if you re-ran this test many times, you’d eventually get unlucky and draw the hash function for which this actually is the worst input, and get a single huge bin. Other times you can get a bad hash in which two or three bins have all the inputs.

An interesting question is, what is really the worst-case input for this algorithm? I suspect it’s characterized by some choice of hash output values, taking all inputs for the chosen outputs. If this is the case, then there’s a tradeoff between the number of inputs you pick and how egregious the worst bin is. As an exercise to the reader, empirically estimate this tradeoff and find the best worst-case input for the adversary. Also, for your choice of parameters, estimate by simulation the probability that the max bin is three times larger than the expected value.

Now that we’ve played around with the basic hashing algorithm and made a family of 2-universal hashes, let’s see the power of two choices. Recall, this algorithm picks two random hash functions and sends an input to the bin with the smallest size. This obviously generalizes to $ k$ choices, although the theoretical guarantee only improves by a constant factor, so let’s implement the more generic version.

class ChoiceHashFamily(object):
   def __init__(self, hashFamily, queryBinSize, numChoices=2):
      self.queryBinSize = queryBinSize
      self.hashFamily = hashFamily
      self.numChoices = numChoices

   def draw(self):
      hashes = [self.hashFamily.draw()
                   for _ in range(self.numChoices)]

      def h(x):
         indices = [h(x) for h in hashes]
         counts = [self.queryBinSize(i) for i in indices]
         count, index = min([(c,i) for (c,i) in zip(counts,indices)])
         return index

      return h

And if we test this with the bad inputs (as used previously, all the inputs that hash to 9), as a typical output we get

>>> bins
[15, 16, 15, 15, 16, 14, 16, 14, 16, 15, 16, 15, 15, 15, 17, 14, 16, 14, 16, 16, 15, 16, 15, 16, 15, 15, 17, 15, 16, 15, 15, 15, 15, 16, 15, 14, 16, 14, 16, 15, 15, 15, 14, 16, 15, 15, 15, 14, 17, 14, 15, 15, 14, 16, 13, 15, 14, 15, 15, 15, 14, 15, 13, 16, 14, 16, 15, 15, 15, 16, 15, 15, 13, 16, 14, 15, 15, 16, 14, 15, 15, 15, 11, 13, 11, 12, 13, 14, 13, 11, 11, 12, 14, 14, 13, 10, 16, 12, 14, 10]

And a typical list of bin maxima is

>>> results
[16, 16, 16, 18, 17, 365, 18, 16, 16, 365, 18, 17, 17, 17, 17, 16, 16, 17, 18, 16, 17, 18, 17, 16, 17, 17, 18, 16, 18, 17, 17, 17, 17, 18, 18, 17, 17, 16, 17, 365, 17, 18, 16, 16, 18, 17, 16, 18, 365, 16, 17, 17, 16, 16, 18, 17, 17, 17, 17, 17, 18, 16, 18, 16, 16, 18, 17, 17, 365, 16, 17, 17, 17, 17, 16, 17, 16, 17, 16, 16, 17, 17, 16, 365, 18, 16, 17, 17, 17, 17, 17, 18, 17, 17, 16, 18, 18, 17, 17, 17]

Those big bumps are the times when we picked an unlucky hash function, which is scarily large, although this bad event would be proportionally less likely as you scale up. But in the good case the load is clearly more even than the previous example, and the max load would get linearly smaller as you pick between a larger set of randomly chosen hashes (obviously).

Coupling this with the technique of switching hash functions when you start to observe a large deviation, and you have yourself an elegant solution.

In addition to load balancing, hashing has a ton of applications. Remember, the main key that you may want to use hashing is when you have an algorithm that works well when the input data is random. This comes up in streaming and sublinear algorithms, in data structure design and analysis, and many other places. We’ll be covering those applications in future posts on this blog.

Until then!

The Čech Complex and the Vietoris-Rips Complex

It’s about time we got back to computational topology. Previously in this series we endured a lightning tour of the fundamental group and homology, then we saw how to compute the homology of a simplicial complex using linear algebra.

What we really want to do is talk about the inherent shape of data. Homology allows us to compute some qualitative features of a given shape, i.e., find and count the number of connected components or a given shape, or the number of “2-dimensional holes” it has. This is great, but data doesn’t come in a form suitable for computing homology. Though they may have originated from some underlying process that follows nice rules, data points are just floating around in space with no obvious connection between them.

Here is a cool example of Thom Yorke, the lead singer of the band Radiohead, whose face was scanned with a laser scanner for their music video “House of Cards.”

radiohead-still

Radiohead’s Thom Yorke in the music video for House of Cards (click the image to watch the video).

Given a point cloud such as the one above, our long term goal (we’re just getting started in this post) is to algorithmically discover what the characteristic topological features are in the data. Since homology is pretty coarse, we might detect the fact that the point cloud above looks like a hollow sphere with some holes in it corresponding to nostrils, ears, and the like. The hope is that if the data set isn’t too corrupted by noise, then it’s a good approximation to the underlying space it is sampled from. By computing the topological features of a point cloud we can understand the process that generated it, and Science can proceed.

But it’s not always as simple as Thom Yorke’s face. It turns out the producers of the music video had to actually degrade the data to get what you see above, because their lasers were too precise and didn’t look artistic enough! But you can imagine that if your laser is mounted on a car on a bumpy road, or tracking some object in the sky, or your data comes from acoustic waves traveling through earth, you’re bound to get noise. Or more realistically, if your data comes from thousands of stock market prices then the process generating the data is super mysterious. It changes over time, it may not follow any discernible pattern (though speculators may hope it does), and you can’t hope to visualize the entire dataset in any useful way.

But with persistent homology, so the claim goes, you’d get a good qualitative understanding of the dataset. Your results would be resistant to noise inherent in the data. It also wouldn’t be sensitive to the details of your data cleaning process. And with a dash of ingenuity, you can come up with a reasonable mathematical model of the underlying generative process. You could use that model to design algorithms, make big bucks, discover new drugs, recognize pictures of cats, or whatever tickles your fancy.

But our first problem is to resolve the input data type error. We want to use homology to describe data, but our data is a point cloud and homology operates on simplicial complexes. In this post we’ll see two ways one can do this, and see how they’re related.

The Čech complex

Let’s start with the Čech complex. Given a point set $ X$ in some metric space and a number $ \varepsilon > 0$, the Čech complex $ C_\varepsilon$ is the simplicial complex whose simplices are formed as follows. For each subset $ S \subset X$ of points, form a $ (\varepsilon/2)$-ball around each point in $ S$, and include $ S$ as a simplex (of dimension $ |S|$) if there is a common point contained in all of the balls in $ S$. This structure obviously satisfies the definition of a simplicial complex: any sub-subset $ S’ \subset S$ of a simplex $ S$ will be also be a simplex. Here is an example of the epsilon balls.

Image credit: Robert Ghrist

An example of a point cloud (left) and a corresponding choice of (epsilon/2)-balls. To get the Cech complex, we add a k-simplex any time we see a subset of k points with common intersection.  [Image credit: Robert Ghrist]

Let me superscript the Čech complex to illustrate the pieces. Specifically, we’ll let $ C_\varepsilon^{j}$ denote all the simplices of dimension up to $ j$. In particular, $ C_\varepsilon^1$ is a graph where an edge is placed between $ x,y$ if $ d(x,y) < \varepsilon$, and $ C_{\varepsilon}^2$ places triangles (2-simplices) on triples of points whose balls have a three-way intersection.

A topologist will have a minor protest here: the simplicial complex is supposed to resemble the structure inherent in the underlying points, but how do we know that this abstract simplicial complex (which is really hard to visualize!) resembles the topological space we used to make it? That is, $ X$ was sitting in some metric space, and the union of these epsilon-balls forms some topological space $ X(\varepsilon)$ that is close in structure to $ X$. But is the Čech complex $ C_\varepsilon$ close to $ X(\varepsilon)$? Do they have the same topological structure? It’s not a trivial theorem to prove, but it turns out to be true.

The Nerve Theorem: The homotopy types of $ X(\varepsilon)$ and $ C_\varepsilon$ are the same.

We won’t remind the readers about homotopy theory, but suffice it to say that when two topological spaces have the same homotopy type, then homology can’t distinguish them. In other words, if homotopy type is too coarse for a discriminator for our dataset, then persistent homology will fail us for sure.

So this theorem is a good sanity check. If we want to learn about our point cloud, we can pick a $ \varepsilon$ and study the topology of the corresponding Čech complex $ C_\varepsilon$. The reason this is called the “Nerve Theorem” is because one can generalize it to an arbitrary family of convex sets. Given some family $ F$ of convex sets, the nerve is the complex obtained by adding simplices for mutually overlapping subfamilies in the same way. The nerve theorem is actually more general, it says that with sufficient conditions on the family $ F$ being “nice,” the resulting Čech complex has the same topological structure as $ F$.

The problem is that Čech complexes are tough to compute. To tell whether there are any 10-simplices (without additional knowledge) you have to inspect all subsets of size 10. In general computing the entire complex requires exponential time in the size of $ X$, which is extremely inefficient. So we need a different kind of complex, or at least a different representation to compensate.

The Vietoris-Rips complex

The Vietoris-Rips complex is essentially the same as the Čech complex, except instead of adding a $ d$-simplex when there is a common point of intersection of all the $ (\varepsilon/2)$-balls, we just do so when all the balls have pairwise intersections. We’ll denote the Vietoris-Rips complex with parameter $ \varepsilon$ as $ VR_{\varepsilon}$.

Here is an example to illustrate: if you give me three points that are the vertices of an equilateral triangle of side length 1, and I draw $ (1/2)$-balls around each point, then they will have all three pairwise intersections but no common point of intersection.

intersection

Three balls which intersect pairwise, but have no point of triple intersection. With appropriate parameters, the Cech and V-R complexes are different.

So in this example the Vietoris-Rips complex is a graph with a 2-simplex, while the Čech complex is just a graph.

One obvious question is: do we still get the benefits of the nerve theorem with Vietoris-Rips complexes? The answer is no, obviously, because the Vietoris-Rips complex and Čech complex in this triangle example have totally different topology! But everything’s not lost. What we can do instead is compare Vietoris-Rips and Čech complexes with related parameters.

Theorem: For all $ \varepsilon > 0$, the following inclusions hold

$ \displaystyle C_{\varepsilon} \subset VR_{\varepsilon} \subset C_{2\varepsilon}$

So if the Čech complexes for both $ \varepsilon$ and $ 2\varepsilon$ are good approximations of the underlying data, then so is the Vietoris-Rips complex. In fact, you can make this chain of inclusions slightly tighter, and if you’re interested you can see Theorem 2.5 in this recent paper of de Silva and Ghrist.

Now your first objection should be that computing a Vietoris-Rips complex still requires exponential time, because you have to scan all subsets for the possibility that they form a simplex. It’s true, but one nice thing about the Vietoris-Rips complex is that it can be represented implicitly as a graph. You just include an edge between two points if their corresponding balls overlap. Once we want to compute the actual simplices in the complex we have to scan for cliques in the graph, so that sucks. But it turns out that computing the graph is the first step in other more efficient methods for computing (or approximating) the VR complex.

Let’s go ahead and write a (trivial) program that computes the graph representation of the Vietoris-Rips complex of a given data set.

import numpy
def naiveVR(points, epsilon):
   points = [numpy.array(x) for x in points]   
   vrComplex = [(x,y) for (x,y) in combinations(points, 2) if norm(x - y) &lt; 2*epsilon]
   return numpy.array(vrComplex)

Let’s try running it on a modestly large example: the first frame of the Radiohead music video. It’s got about 12,000 points in $ \mathbb{R}^4$ (x,y,z,intensity), and sadly it takes about twenty minutes. There are a couple of ways to make it more efficient. One is to use specially-crafted data structures for computing threshold queries (i.e., find all points within $ \varepsilon$ of this point). But those are only useful for small thresholds, and we’re interested in sweeping over a range of thresholds. Another is to invoke approximations of the data structure which give rise to “approximate” Vietoris-Rips complexes.

Other stuff

In a future post we’ll implement a method for speeding up the computation of the Vietoris-Rips complex, since this is the primary bottleneck for topological data analysis. But for now the conceptual idea of how Čech complexes and Vietoris-Rips complexes can be used to turn point clouds into simplicial complexes in reasonable ways.

Before we close we should mention that there are other ways to do this. I’ve chosen the algebraic flavor of topological data analysis due to my familiarity with algebra and the work based on this approach. The other approaches have a more geometric flavor, and are based on the Delaunay triangulation, a hallmark of computational geometry algorithms. The two approaches I’ve heard of are called the alpha complex and the flow complex. The downside of these approaches is that, because they are based on the Delaunay triangulation, they have poor scaling in the dimension of the data. Because high dimensional data is crucial, many researchers have been spending their time figuring out how to speed up approximations of the V-R complex. See these slides of Afra Zomorodian for an example.

Until next time!

RealityMining, a Case Study in the Woes of Data Processing

This post is intended to be a tutorial on how to access the RealityMining dataset using Python (because who likes Matlab?), and a rant on how annoying the process was to figure out.

RealityMining is a dataset of smart-phone data logs from a group of about one hundred MIT students over the course of a year. The data includes communication and cell tower data, the latter being recorded every time a signal changes from one tower to the next. They also conducted a survey of perceived friendship, personal attributes, and a few other things. See the website for more information about what data is included. Note that the website lies: there is no lat/long location data. This is but the first sign that this is going to be a rough road.

In order to access the data yourself, you must fill out a form on the RealityMining website and wait for an email, but I’ve found their turnaround time to giving a download link is quite quick. The data comes in the form of a .mat file, “realitymining.mat”, which is a proprietary Matlab data format (warning sign number two, and largely the cause of all my problems).

Before we get started let me state my goals: to extract the person-to-person voice call data, actual proximity data (whether two subjects were connected to the same cell tower at the same time), and perceived friendship/proximity data. The resulting program is free to use and modify, and is posted on this blog’s Github page.

Loading realitymining.mat into Python

Once I found out that the Python scipy library has a function to load the Matlab data into Python, I was hopeful that things would be nice and user friendly. Boy was I wrong.

To load the matlab data into Python,

import scipy.io

matlab_filename = ...
matlab_obj = scipy.io.loadmat(matlab_filename)

Then you get a variable matlab_obj which, if you try to print it stalls the python process and you have to frantically hit the kill command to wrestle back control.

Going back a second time and being more cautious, we might inspect the type of the imported object:

&gt;&gt;&gt; type(matlab_obj)
&lt;class 'dict'&gt;

Okay so it’s a dictionary, probably why it tried to print out its entire 2GB contents when I printed it. What are it’s keys?

&gt;&gt;&gt; matlab_obj.keys()
dict_keys(['network', '__header__', '__version__', '__globals__', 's'])

So far so good (the only ones that matter are ‘network’ and ‘s’). So what is ‘s’?

&gt;&gt;&gt; type(matlab_obj['s'])
&lt;class 'numpy.ndarray'&gt;
&gt;&gt;&gt; len(matlab_obj['s'])
1

Okay it’s an array of length 1… What’s its first element?

&gt;&gt;&gt; type(matlab_obj['s'][0])
&lt;class 'numpy.ndarray'&gt;
&gt;&gt;&gt; len(matlab_obj['s'][0])
106

Okay so ‘s’ is an array of one element containing the array of actual data (I remember the study has 106 participants). Now let’s look at the first participant.

&gt;&gt;&gt; type(matlab_obj['s'][0][0])
&lt;class 'numpy.void'&gt;

Okay… what’s a numpy.void? Searching Google for “What’s a numpy.void” gives little, but combing through Stackoverflow answers gives the following:

From what I understand, void is sort of like a Python list since it can hold objects of different data types, which makes sense since the columns in a structured array can be different data types.

So basically, numpy arrays that aren’t homogeneous are ‘voids’ and you can’t tell any information about the values. Great. Well I suppose we could just try *printing* out the value, at the cost of potentially having to kill the process and restart.

&gt;&gt;&gt; matlab_obj['s'][0][0]
([], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [], [[]], [[0]], [], [[0]], [], [[0]], [], [], [], [], [], [], [], [], [], [], [], [], [[413785662906.0]], [], [], [[]], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [])

Okay… so the good news is we didn’t die, but the bad news is that there is almost no data, and there’s no way to tell what the data represents!

As a last straw let’s try printing out the second user.

&gt;&gt;&gt; matlab_obj['s'][0][1]
([[(array([[ 732338.86915509]]), array([[354]], dtype=uint16), array([[-1]], dtype=int16), array(['Packet Data'],
dtype='&amp;lt;U11'), array(['Outgoing'],
dtype='&amp;lt;U8'), array([[0]], dtype=uint8),
...
* a thousand or so lines later *
...
[2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [0, 0], [2, 0], [3, 0], [3, 0], [3, 0], [3, 0], [3, 0]])

So this one has data, but still we have the problem that we have no idea what the various pieces of data represent! We could guess and check, but come on we’re in the 21st century.

Hopelessness, and then Hope

At this point I gave up on a Python approach. As far as I could tell, the data was in a completely unusable format. My only recourse was to spend two weeks obtaining a copy of Matlab from my institution, spend an hour installing it, and then try to learn enough Matlab to parse out the subset of data I wanted into a universal (plaintext) format, so I could then read it back into Python and analyze it.

Two days after obtaining Matlab, I was referred to a blog post by Conrad Lee who provided the source code he used to parse the data in Python. His code was difficult to read and didn’t work out of the box, so I had to disassemble his solution and craft my own. This and a tutorial I later found, and enough elbow grease, were enough to help me figure things out.

The first big insight was that a numpy.void also acts like a dictionary! There’s no way to tell this programmatically, no keys() attribute on a numpy.void, so apparently it’s just folklore and bad documentation. To get the fields of a subject, we have to know its key (the RealityMining documentation tells you this, with enough typos to keep you guessing).

&gt;&gt;&gt; subjects = matlab_obj['s'][0]
&gt;&gt;&gt; subjects[77]['my_hashedNumber']
array([[78]], dtype=uint8)
&gt;&gt;&gt; subjects[76]['mac']
array([[ 6.19653599e+10]])
&gt;&gt;&gt; subjects[76]['mac'][0][0]
61965359909.0

The ‘mac’ and ‘my_hashedNumber’ are supposed to be identifiers for the subjects. It’s not clear why they have two such identifiers, but what’s worse is that not all participants in the study (not all subjects in the array of subjects, at least) have mac numbers or hashedNumbers. So in order to do anything I had to extract the subset of valid subjects.

def validSubjects(allSubjects):
   return [s for s in allSubjects if hasNumeric(s,'mac') and hasNumeric(s,'my_hashedNumber')]

I define the hasNumeric function in the next section. But before that I wanted to keep track of all the stupid ids and produce my own id (contiguous, starting from zero, for valid subjects only) that will replace me having to deal with the stupid matlab subject array. So I did.

def idDicts(subjects):
   return (dict((i, s) for (i,s) in enumerate(subjects)),
      dict((getNumeric(s,'mac'), (i, s)) for (i,s) in enumerate(subjects)),
      dict((getNumeric(s, 'my_hashedNumber'), (i, s)) for (i,s) in enumerate(subjects)))

Given a list of subjects (the valid ones), we create three dictionaries whose keys are the various identifiers, and whose values are the subject objects (and my own id). So we’ll pass around these dictionaries in place of the subject array.

Extracting Communication Events

The important data here is the set of person-to-person phone calls made over the course of the study. But before we get there we need to inspect how the data is stored more closely. Before I knew about what follows, I was getting a lot of confusing index errors.

Note that the numpy object that gets returned from scipy’s loadmat function stores single-type data differently from compound data, and what’s worse is that it represents a string (which in Python should be a single-type piece of data) as a compound piece of data!

So I wrote helper functions to keep track of checking/accessing these fields. I feel like such a dirty programmer writing code like what follows, but I can still blame everyone else.

Single-type data (usually numeric):

def hasNumeric(obj, field):
   try:
      obj[field][0][0]
      return True
   except:
      return False

def getNumeric(obj, field):
   return obj[field][0][0]

Compound-type data (usually Arrays, but also dicts and strings):

def hasArray(obj, field):
   try:
      obj[field][0]
      return True
   except:
      return False

def getArray(obj, field):
   return obj[field][0]

Note the difference is for single indexing (using [0]) or multiple (using [0][0]). It occurred to me much later that there is an option to loadmat that may get rid of some or all of this doubly-nested array crap, but by then I was too lazy to change everything; it would undoubtedly be the case that some things would break and other wouldn’t (and even if that’s not the case, you can’t blame me for assuming it). I will never understand why the scipy folks chose this representation as the default.

This issue crops up when inspecting a single comm event:

&gt;&gt;&gt; subjects[76]['comm'][0][0]
([[732238.1947106482]], [[0]], [[-1]], ['Voice call'], ['Outgoing'], [[16]], [[4]])

Again this record is a numpy.void with folklore keys, but notice how numeric values are doubly nested in lists while string values are not. Makes me want to puke.

Communication events can be “Incoming,” “Outgoing,” or “Missed,” although the last of these does not exist in the documentation. Not all fields exist in every record. Some are empty lists and some are double arrays containing nan, but you can check by hand that everyone that has a nonempty list of comm events has the needed fields as well:

&gt;&gt;&gt; len([x for x in subjects if len(x['comm']) &gt; 0 and len(x['comm'][0]) &gt; 0])
81
&gt;&gt;&gt; len([x for x in subjects if len(x['comm']) &gt; 0 and len(x['comm'][0]) &gt; 0 and
... all([hasArrayField(e, 'description') for e in x['comm'][0]])])
81

The exception is the duration field. If the duration field does not exist (if the field is an empty list), then it definitely corresponds to the “Missed” direction field.

It could otherwise be that the duration field is zero but the call was not labeled “Missed.” This might happen because they truncate the seconds, and the call lated less than one second (dubious) or because the call was also missed, or rejected without being missed. The documentation doesn’t say. Moreover, “Missed” doesn’t specify if the missed call was incoming or outgoing. So incorporating that data would ruin any chance of having a directed graph representation.

One possibility is that the outgoing calls with a duration of 0 are missed outgoing calls, and incoming calls always have positive duration, but actually there are examples of incoming calls of zero duration. Madness!

After sorting all this out (really, guessing), I was finally able to write a program that extracts the person-to-person call records in a specified date range.

First, to extract all communication events:

def allCommEvents(idDictionary):
   events = []
   for subjectId, subject in idDictionary.items():
      if hasArray(subject, 'comm'):
         events.extend([(subjectId, event) for event in getArray(subject, 'comm')])

   return events

Then to extract that subset of communication events that were actually phone calls between users within the study:

def callsWithinStudy(commEvents, hashNumDict):
   return [(subjectId, e) for (subjectId, e) in commEvents
           if (getArray(e, 'description') == &quot;Voice call&quot; and
              getNumeric(e, 'hashNum') in hashNumDict)]

Now we can convert the calls into a nicer format (a Python dictionary) and apply some business rules to deal with the ambiguities in the data.

def processCallEvents(callEvents, hashNumDict):
   processedCallEvents = []

   for subjectId, event in callEvents:
      direction = getArray(event, 'direction')
      duration = 0 if direction == 'Missed' else getNumeric(event, 'duration')
      date = convertDatetime(getNumeric(event, 'date'))
      hashNum = getNumeric(event, 'hashNum')
      otherPartyId = hashNumDict[hashNum][0]

      eventAsDict = {'subjectId': subjectId,
                      'direction': direction,
                      'duration': duration,
                      'otherPartyId': otherPartyId,
                      'date': date}
      processedCallEvents.append(eventAsDict)

   print(&quot;%d call event dictionaries&quot; % len(processedCallEvents))
   return processedCallEvents

All I needed to do was filter them by time now (implement the `convertDateTime` function used above)…

Converting datetimes from Matlab to Python

Matlab timestamps are floats like

834758.1231233

The integer part of the float represents the number of days since Jan 1, 0AD. The fractional part represents a fraction of a day, down to seconds I believe. To convert from the matlab Format to the Python format, use

def convertDatetime(dt):
   return datetime.fromordinal(int(dt)) + timedelta(days=dt%1) - timedelta(days=366) - timedelta(hours=5)

This is taken without permission from Conrad’s blog (I doubt he’ll care), and he documents his reasoning in a side-post.

This then allows one to filter a list of communication events by time:

def inRange(dateRange, timevalue):
   start, end = dateRange
   unixTime = int(time.mktime(timevalue.timetuple()))
   return start &lt;= unixTime &lt;= end

def filterByDate(dateRange, events):
   return [e for e in events if inRange(dateRange, e['date'])]

Phew! Glad I didn’t have to go through that hassle on my own.

Extracting Survey Data

The dataset includes a survey about friendship and proximity among the participants. That means we have data on

  1. Who considers each other in the same “close group of friends.”
  2. An estimate on how much time at work each person spends with each other person.
  3. An estimate on how much time outside work each person spends with each other person.

This information comes in the form of three lists. These lists sit inside a different numpy.void sitting in the original matlab_object from the beginning of the post.

matlab_obj = scipy.io.loadmat(matlab_filename)
matlab_obj['network'][0][0]

Now the access rules (getNumeric, getArray) we used for the matlab_obj[‘s’] matrix don’t necessarily apply to the values in the ‘network’ matrix. The friendship survey is accessed by

friends = network['friends']

and it’s an array, each element of which is a participant’s response to the survey (claimed to be stored as a 0-1 array). So, for example, to get the number corresponding to whether id1 considers id2 a close friend, you would use

friends = network['friends'][id1][id2]

This is all straightforward, but there are two complications (stupid trivialities, really). The first is that the matrix is not 0-1 valued, but (nan-1.0)-valued. That’s right, if the individual didn’t consider someone a friend, the record for that is numpy’s nan type, and if they did it’s the float 1.0. So you need to either replace all nan’s with zeros ahead of time, or take account for that in any computations you do with the data.

I don’t know whether this is caused by scipy’s bad Matlab data import or the people producing the data, but it’s a clear reason to not use proprietary file formats to distribute data.

The second problem is that the survey is stored as an array, where the index of the array corresponds to the person the respondent is evaluating for friendship. The problem with this is that not all people are in the survey (for whatever reason, people dropping out or just bad study design), so they include a separate array in the network object called “sub_sort” (a totally unhelpful name). Sub_sort contains in its $i$-th position the hashNum (of the user from the ‘s’ matrix) of “survey-user” $i$ (that is the user whose answer is represented in index $ i$ of the network array).

This is incredibly confusing, so let’s explain it with an example. Say that we wanted to parse user 4’s friendship considerations. We’d first access his friends list by

friends = network['friends'][4]

Then, to see whether he considers user 7 as a friend, we’d have to look up user 7 in the sub_sort array (which requries a search through said array) and then use the index of user 7 in that array to index the friends list appropriately. So

subsortArray = network['sub_sort'][0]
user7sIndex = subsortArray.indexof(7)
network['friends'][4][user7sIndex]

The reason this is such terrible design is that an array is the wrong data structure to use. They should have used a list of dictionaries or a dictionary with ordered pairs of hashNums for keys. Yet another reason that matlab (a language designed for numerical matrix manipulation) is an abysmal tool for the job.

The other two survey items are surveys on proximity data. Specifically, they ask the respondent to report how much time per day he or she spends within 10 feet of each other individual in the study.

inLabProximity = network['lab']
outLabProximity = network['outlab']

The values in these matrices are accessed in exactly the same way, with the values being in the set {nan, 1.0, 2.0, 3.0, 4.0, 5.0}. These numbers correspond to various time ranges (the documentation is a bit awkward about this too, using “at least” in place of complete time ranges), but fine. It’s no different from the friends matrix, so at least it’s consistent.

Cell Tower Data

Having learned all of the gotchas I did, I processed the cell tower data with little fuss. The only bothersome bit was that they provide two “parallel” arrays, one called “locs” containing pairs of (date, cell tower id), and a second called “loc_ids” containing numbers corresponding to a “second” set of simplified cell tower ids. These records correspond to times when a user entered into a region covered by a specific tower, and so it gives very (coarse location) data. It’s not the lat/long values promised by the website, and even though the ids for the cell towers are governed by a national standard, the lat/long values for the towers are not publicly or freely available in database form.

Why they need two sets of cell tower ids is beyond me, but the bothersome part is that they didn’t include the dates in the second array. So you must assume the two arrays are in sync (that the events correspond elementwise by date), and if you want to use the simplified ids you still have to access the “locs” array for the dates (because what use would the tower ids be without dates?).

One can access these fields by ignoring our established rules for accessing subject fields (so inconsistent!). And so to access the parallel arrays in a reasonable way, I used the following:

towerEvents = list(zip(subject['locs'], subject['loc_ids']))

Again the datetimes are in matlab format, so the date conversion function above is used to convert them to python datetimes.

Update: As I’ve continued to work with this data set, I’ve found out that, once again contrary to the documentation, the tower id 1 corresponds to being out of signal. The documentation claims that tower id 0 implies no signal, and this is true of the “locs” array, but for whatever reason they decided to change it to a 1 in the “loc_ids” array. What’s confusing is that there’s no mention of the connection. You have to guess it and check that the lengths of the corresponding lists of events are equal (which is by no means proof that the towers ids are the same, but good evidence).

The documentation is also unclear as to whether the (datestamp, tower id) pair represents the time when the user started using that tower, or the time when the user finished using that tower. As such, I must guess that it’s the former at the risk of producing bad data (and hence bad research).

Conclusions

I love free data sets. I really do, and I appreciate all the work that researchers put into making and providing their data. But all the time I encounter the most horrible, terrible, no good, very bad data sets! RealityMining is only one small example, and it was actually nice because I was able to figure things out eventually. There are other datasets that have frustrated me much more with inconsistencies and complete absence of documentation.

Is it really that hard to make data useable? Here are the reasons why this was such an annoying and frustrating task:

  1. The data was stored in a proprietary format. Please please please release your data in plaintext files!
  2. The documentation contradicted the data it was documenting, and was altogether unclear. (One contradiction is too often. This was egregious.)
  3. There was even inconsistency within the data. Sometimes missing values were empty lists, sometimes numpy.nan, and sometimes simply missing.
  4. Scipy has awful defaults for the loadmat function.
  5. Numpy data structures are apparently designed for opacity unless you know the folklore and spend hours poring over documentation.

It seems that most of this could be avoided by fixing 1 and 2, since then it’s a problem that an automated tool like Google Refine can fix (though I admit to never having tried Refine, because it crashed on the only dataset I’ve ever given it). I know scientists have more important things to worry about, and I could just blame it on the 2004 origin of the data set, but in the age of reproducibility and big data (and seeing that this data set is widely used in social network research), it’s just as important to release the data as it is to make the data fit for consumption. RealityMining is like a math paper with complicated terms defined only by example, and theorems stated without proof that are actually false. It doesn’t uphold the quality that science must, it makes me question the choices made by other researchers who have had to deal with the dataset, and it wastes everyone else’s time.

In this respect, I’ve decided to post my Python code in a github repository, as is usual on this blog. You’ll have to get the actual RealityMining dataset yourself if you want to use it, but once you have it you can run my program (it depends only on the numpy/scipy stack and the standard library; tested in Python 3.3.3) to get the data I’m interested in or modify it for your own purposes.

Moral of the story: For the sake of science, please don’t distribute a database as a Matlab matrix! Or anything else for that matter!

Categories as Types

In this post we’ll get a quick look at two ways to define a category as a type in ML. The first way will be completely trivial: we’ll just write it as a tuple of functions. The second will involve the terribly-named “functor” expression in ML, which allows one to give a bit more structure on data types.

The reader unfamiliar with the ML programming language should consult our earlier primer.

What Do We Want?

The first question to ask is “what do we want out of a category?” Recall from last time that a category is a class of objects, where each pair of objects is endowed with a set of morphisms. The morphisms were subject to a handful of conditions, which we paraphrase below:

  • Composition has to make sense when it’s defined.
  • Composition is associative.
  • Every object has an identity morphism.

What is the computational heart of such a beast? Obviously, we can’t explicitly represent the class of objects itself. Computers are finite (or at least countable) beings, after all. And in general we can’t even represent the set of all morphisms explicitly. Think of the category $ \mathbf{Set}$: if $ A, B$ are infinite, then there are infinitely many morphisms (set-functions) $ A \to B$.

A subtler and more important point is that everything in our representation of a category needs to be a type in ML (for indeed, how can a program run without being able to understand the types involved). This nudges us in an interesting direction: if an object is a type, and a morphism is a type, then we might reasonably expect a composition function:

compose: 'arrow * 'arrow -> 'arrow

We might allow this function to raise an exception if the two morphisms are uncomposable. But then, of course, we need a way to determine if morphisms are composable, and this requires us to know what the source and target functions are.

source: 'arrow -> 'object
target: 'arrow -> 'object

We can’t readily enforce composition to be associative at this stage (or any), because morphisms don’t have any other manipulable properties. We can’t reasonably expect to be able to compare morphisms for equality, as we aren’t able to do this in $ \mathbf{Set}$.

But we can enforce the final axiom: that every object has an identity morphism. This would come in the form of a function which accepts as input an object and produces a morphism:

identity: 'object -> 'arrow

But again, we can’t necessarily enforce that identity behaves as its supposed to.

Even so, this should be enough to define a category. That is, in ML, we have the datatype

datatype ('object, 'arrow)Category =
 category of ('arrow -> 'object) *
             ('arrow -> 'object) *
             ('object -> 'arrow) *
             ('arrow * 'arrow -> 'arrow)

Where we understand the first two functions to be source and target, and the third and fourth to be identity and compose, respectively.

Categories as Values

In order to see this type in action, we defined (and included in the source code archive for this post) a type for homogeneous sets. Since ML doesn’t support homogeneous lists natively, we’ll have to settle for a particular subcategory of $ \mathbf{Set}$. We’ll call it $ \mathbf{Set}_a$, the category whose objects are finite sets whose elements have type $ a$, and whose morphisms are again set-functions. Hence, we can define an object in this category as the ML type

'a Set

and an arrow as a datatype

datatype 'a SetMap = setMap of ('a Set) * ('a -> 'a) * ('a Set)

where the first object is the source and the second is the target (mirroring the notation $ A \to B$).

Now we must define the functions required for the data constructor of a Category. Source and target are trivial.

fun setSource(setMap(a, f, b)) = a
fun setTarget(setMap(a, f, b)) = b

And identity is equally natural.

fun setIdentity(aSet) = setMap(aSet, (fn x => x), aSet)

Finally, compose requires us to check for set equality of the relevant source and target, and compose the bundled set-maps. A few notational comments: the “o” operator does function composition in ML, and we defined “setEq” and the “uncomposable” exception elsewhere.

fun setCompose(setMap(b', g, c), setMap(a, f, b)) =
      if setEq(b, b') then
         setMap(a, (g o f), c)
      else
         raise uncomposable

Note that the second map in the composition order is the first argument to the function, as is standard in mathematical notation.

And now this category of finite sets is just the instantiation

val FiniteSets = category(setSource, setTarget, setIdentity, setCompose)

Oddly enough, this definition does not depend on the choice of a type for our sets! None of the functions we defined care about what the elements of sets are, so long as they are comparable for equality (in ML, such a type is denoted with two leading apostrophes: ”a).

This example is not very interesting, but we can build on it in two interesting ways. The first is to define the poset category of sets. Recall that the poset category of a set $ X$ has as objects subsets of $ X$ and there is a unique morphism from $ A \to B$ if and only if $ A \subset B$. There are two ways to model this in ML. The first is that we can assume the person constructing the morphisms is giving correct data (that is, only those objects which satisfy the subset relation). This would amount to a morphism just representing “nothing” except the pair $ (A,B)$. On the other hand, we may also require a function which verifies that the relation holds. That is, we could describe a morphism in this (or any) poset category as a datatype

datatype 'a PosetArrow = ltArrow of 'a * ('a -> 'a -> bool) * 'a

We still do need to assume the user is not creating multiple arrows with different functions (or we must tacitly call them all equal regardless of the function). We can do a check that the function actually returns true by providing a function for creating arrows in this particular poset category of a set with subsets.

exception invalidArrow
fun setPosetArrow(x)(y) = if subset(x)(y) then ltArrow(x, subset, y)
                          else raise invalidArrow

Then the functions defining a poset category are very similar to those of a set. The only real difference is in the identity and composition functions, and it is minor at that.

fun posetSource(ltArrow(x, f, y)) = x
fun posetTarget(ltArrow(x, f, y)) = y
fun posetIdentity(x) = ltArrow(x, (fn z => fn w => true), x)
fun posetCompose(ltArrow(b', g, c), ltArrow(a, f, b)) =
      if b = b' then
         ltArrow(a, (fn x => fn y => f(x)(b) andalso g(b)(y)), c)
      else
         raise uncomposable

We know that a set is always a subset of itself, so we can provide the constant true function in posetIdentity. In posetCompose, we assume things are transitive and provide the logical conjunction of the two verification functions for the pieces. Then the category as a value is

val Posets = category(posetSource, posetTarget, posetIdentity, posetCompose)

One will notice again that the homogeneous type of our sets is irrelevant. The poset category is parametric. To be completely clear, the type of the Poset category defined above is

('a Set, ('a Set)PosetArrow)Category

and so we can define a shortcut for this type.

type 'a PosetCategory = ('a Set, ('a Set)PosetArrow)Category

The only way we could make this more general is to pass the constructor for creating a particular kind of posetArrow (in our case, it just means fixing the choice of verification function, subset, for the more generic type constructor). We leave this as an exercise to the reader.

Our last example will return again to our abstract “diagram category.” Recall that if we fix an object $ A$ in a category $ \mathbf{C}$, the category $ \mathbf{C}_A$ has as objects the morphisms with fixed source $ A$,

vert-arrow

and the arrows are commutative diagrams

diagram-morphismLets fix $ A$ to be a set, and define a function to instantiate objects in this category:

fun makeArrow(fixedSet)(f)(target) = setMap(fixedSet, f, target)

That is, if I have a fixed set $ A$, I can call the function like so

val makeObject = makeArrow(fixedSet)

And use this function to create all of my objects. A morphism is then just a pair of set-maps with a connecting function. Note how similar this snippet of code looks structurally to the other morphism datatypes we’ve defined. This should reinforce the point that morphisms really aren’t any different in this abstract category!

datatype 'a TriangleDiagram = 
   triangle of ('a SetMap) * ('a -> 'a) * ('a SetMap)

And the corresponding functions and category instantiation are

fun triangleSource(triangle(x, f, y)) = x
fun triangleTarget(triangle(x, f, y)) = y
fun triangleIdentity(x) = triangle(x, (fn z => z), x)
fun triangleCompose(triangle(b', g, c), triangle(a, f, b)) =
      if setEq(setTarget(b'))(setTarget(b)) then
         triangle(a, (g o f), c)
      else
         raise uncomposable

val SetDiagrams =
   category(triangleSource, triangleTarget, triangleIdentity, triangleCompose)

Notice how we cannot compare the objects in this category for equality! Indeed, two functions cannot be compared in ML, and the best we can do is compare the targets of these functions to ensure that the connecting morphisms can be composed. A malicious programmer might then be able to compose uncomposable morphisms, a devious and startling proposition.

Notice further how we can’t avoid using set-specific functions with our current definition of a category. What we could do instead is require a category to have a function which checks for composability. Then compose could be written in a more homogeneous (but not quite completely parametric fashion). The reader is welcome to try this on their own, but we’ll soon have a more organized way to do this later, when we package up a category into an ML structure.

Categories as Signatures

Let’s look at a more advanced feature of ML and see how we can apply it to representing categories in code. The idea is familiar to most Java and C++ programmers, and that is of an interface: the programmer specifies an abstract type which has specific functions attached to it, but does not implement those functions. Then some (perhaps different) programmer implements the interface by defining those functions for a concrete type.

For the reader who isn’t so interested in these features of ML, feel free to skip this section. We won’t be introducing any new examples of categories, just rephrasing what we’ve done above in a different way. It is also highly debatable whether this new way is actually any better (it’s certainly longer).

The corresponding concept in ML is called a signature (interface) and a structure (concrete instance). The only difference is that a structure is removed one step further from ML’s type system. This is in a sense necessary: a structure should be able to implement the requirements for many different signatures, and a signature should be able to have many different structures implement it. So there is no strict hierarchy of types to lean on (it’s just a many-to-many relationship).

On the other hand, not having any way to think of a structure as a type at all would be completely useless. The whole point is to be able to write function which manipulate a structure (concrete instance) by abstractly accessing the signature’s (interface’s) functions regardless of the particular implementation details. And so ML adds in another layer of indirection: the functor (don’t confuse this with categorical functors, which we haven’t talked abou yet). A functor in ML is a procedure which accepts as input a structure and produces as output another structure.

Let’s see this on a simple example. We can define a structure for “things that have magnitudes,” which as a signature would be

signature MAG_OBJ =
sig
   type object
   val mag : object -> int
end

This introduced a few new language forms: the “signature” keyword is like “fun” or “val,” it just declares the purpose of the statement as a whole. The “sig … end” expression is what actually creates the signature, and the contents are a list of type definitions, datatype definitions, and value/function type definitions. The reader should think of this as a named “type” for structures, analogous to a C struct.

To implement a signature in ML is to “ascribe” it. That is, we can create a structure that defines the functions laid out in a signature (and perhaps more). Here is an example of a structure for a mag object of integers:

structure IntMag : MAG_OBJ = 
struct
   type object = int
   fun mag(x) = if x >= 0 then x else ~x
end

The colon is a type ascription, and at the compiler-level it goes through the types and functions defined in this structure and attempts to match them with the required ones from the MAG_OBJ signature. If it fails, it raises a compiler error. A more detailed description of this process can be found here. We can then use the structure as follows:

local open IntMag in
   val y = ~7
   val z = mag(y)
end

The open keyword binds all of the functions and types to the current environment (and so it’s naturally a bad idea to open structures in the global environment). The “local” construction is made to work specifically with open.

In any case, the important part about signatures and structures is that one can write functions which accept as input structures with a given signature and produce structures with other signatures. The name for this is a “functor,” but we will usually call it an “ML functor” to clarify that it is not a categorical functor (if you don’t know what a categorical functor is, don’t worry yet).

For example, here is a signature for a “distance object”

signature DIST_OBJ =
sig
   type object
   val dist: object * object -> int
end

And a functor which accepts as input a MAG_OBJ and creates a DIST_OBJ in a contrived and silly way.

functor MakeDistObj(structure MAG: MAG_OBJ) =
struct
   type object = MAG.object
   val dist = fn (x, y) =>
               let
                  val v = MAG.mag(x) - MAG.mag(y)
               in
                  if v > 0 then v else ~v
               end
end

Here the argument to the functor is a structure called MAG, and we need to specify which type is to be ascribed to it; in this case, it’s a MAG_OBJ. Only things within the specified signature can be used within the struct expression that follows (if one needs a functor to accept as input something which has multiple type ascriptions, one can create a new signature that “inherits” from both). Then the right hand side of the functor is just a new structure definition, where the innards are allowed to refer to the properties of the given MAG_OBJ.

We could in turn define a functor which accepts a DIST_OBJ as input and produces as output another structure, and compose the two functors. This is started to sound a bit category-theoretical, but we want to remind the reader that a “functor” in category theory is quite different from a “functor” in ML (in particular, we have yet to mention the former, and the latter is just a mechanism to parameterize and modularize code).

In any case, we can do the same thing with a category. A general category will be represented by a signature, and a particular instance of a category is a structure with the implemented pieces.

signature CATEGORY =
sig
   type object
   type arrow
   exception uncomposable

   val source: arrow -> object
   val target: arrow -> object
   val identity: object -> arrow
   val composable: arrow * arrow -> bool
   val compose: arrow * arrow -> arrow
end

As it turns out, ML has an implementation of “sets” already, and it uses signatures. So instead of using our rinky-dink implementation of sets from earlier in this post, we can implement an instance of the ORD_KEY signature, and then call one of the many set-creating functors available to us. We’ll use an implementation based on red-black trees for now.

First the key:

structure OrderedInt : ORD_KEY =
struct
   type ord_key = int
   val compare = Int.compare
end

then the functor:

functor MakeSetCategory(structure ELT: ORD_KEY) =
struct
   structure Set:ORD_SET = RedBlackSetFn(ELT)
   type elt = ELT.ord_key
   type object = Set.set
   datatype arrow = function of object * (elt -> elt) * object

   exception uncomposable

   fun source(function(a, _, _)) = a
   fun target(function(_, _, b)) = b
   fun identity(a) = function(a, fn x => x, a)
   fun composable(a,b) = Set.equal(a,b)
   fun compose(function(b2, g, c), function(a, f, b1)) =
      if composable(b1, b2) then
         function(a, (g o f), c)
      else
         raise uncomposable
   fun apply(function(a, f, b), x) = f(x)
end

The first few lines are the most confusing; the rest is exactly what we’ve seen from our first go-round of defining the category of sets. In particular, we call the RedBlackSetFn functor given the ELT structure, and it produces an implementation of sets which we name Set (and superfluously ascribe the type ORD_SET for clarity).

Then we define the “elt” type which is used to describe an arrow, the object type as the main type of an ORD_SET (see in this documentation that this is the only new type available in that signature), and the arrow type as we did earlier.


We can then define the category of sets with a given type as follows

structure IntSets = MakeSetCategory(structure ELT = OrderedInt)

The main drawback of this approach is that there’s so much upkeep! Every time we want to make a new kind of category, we need to define a new functor, and every time we want to make a new kind of category of sets, we need to implement a new kind of ORD_KEY. All of this signature and structure business can be quite confusing; it often seems like the compiler doesn’t want to cooperate when it should.

Nothing Too Shocking So Far

To be honest, we haven’t really done anything very interesting so far. We’ve seen a single definition (of a category), looked at a number of examples to gain intuition, and given a flavor of how these things will turn into code.

Before we finish, let’s review the pros and cons of a computational representation of category-theoretical concepts.

Pros:

  • We can prove results by explicit construction (more on this next time).
  • Different-looking categories are structurally similar in code.
  • We can faithfully represent the idea of using (objects and morphisms of) categories as parameters to construct other categories.
  • Writing things in code gives a fuller understanding of how they work.

Cons:

  • All computations are finite, requiring us to think much harder than a mathematician about the definition of an object.
  • The type system is too weak. we can’t enforce the axioms of a category directly or even ensure the functions act sensibly at all. As such, the programmer becomes responsible for any failure that occur from bad definitions.
  • The type system is too strong. when we work concretely we are forced to work within homogeneous categories (e.g. of ‘a Set).
  • We cannot ensure the ability to check equality on objects. This showed up in our example of diagram categories.
  • The functions used in defining morphisms, e.g. in the category of sets, are somewhat removed from real set-functions. For example, nothing about category theory requires functions to be computable. Moreover, nothing about our implementation requires the functions to have any outputs at all (they may loop infinitely!). Moreover, it is not possible to ensure that any given function terminates on a given input set (this is the Halting problem).

This list looks quite imbalanced, but one might argue that the cons are relatively minor compared to the pros. In particular (and this is what this author hopes to be the case), being able to explicitly construct proofs to theorems in category theory will give one a much deeper understanding both of category theory and of programming. This is the ultimate prize.

Next time we will begin our investigation of universal properties (where the real fun starts), and we’ll use the programmatic constructions we laid out in this post to give constructive proofs that various objects are universal.

Until then!