# RealityMining, a Case Study in the Woes of Data Processing

This post is intended to be a tutorial on how to access the RealityMining dataset using Python (because who likes Matlab?), and a rant on how annoying the process was to figure out.

RealityMining is a dataset of smart-phone data logs from a group of about one hundred MIT students over the course of a year. The data includes communication and cell tower data, the latter being recorded every time a signal changes from one tower to the next. They also conducted a survey of perceived friendship, personal attributes, and a few other things. See the website for more information about what data is included. Note that the website lies: there is no lat/long location data. This is but the first sign that this is going to be a rough road.

In order to access the data yourself, you must fill out a form on the RealityMining website and wait for an email, but I’ve found their turnaround time to giving a download link is quite quick. The data comes in the form of a .mat file, “realitymining.mat”, which is a proprietary Matlab data format (warning sign number two, and largely the cause of all my problems).

Before we get started let me state my goals: to extract the person-to-person voice call data, actual proximity data (whether two subjects were connected to the same cell tower at the same time), and perceived friendship/proximity data. The resulting program is free to use and modify, and is posted on this blog’s Github page.

Once I found out that the Python scipy library has a function to load the Matlab data into Python, I was hopeful that things would be nice and user friendly. Boy was I wrong.

To load the matlab data into Python,

import scipy.io

matlab_filename = ...


Then you get a variable matlab_obj which, if you try to print it stalls the python process and you have to frantically hit the kill command to wrestle back control.

Going back a second time and being more cautious, we might inspect the type of the imported object:

>>> type(matlab_obj)
<class 'dict'>


Okay so it’s a dictionary, probably why it tried to print out its entire 2GB contents when I printed it. What are it’s keys?

>>> matlab_obj.keys()
dict_keys(['network', '__header__', '__version__', '__globals__', 's'])


So far so good (the only ones that matter are ‘network’ and ‘s’). So what is ‘s’?

>>> type(matlab_obj['s'])
<class 'numpy.ndarray'>
>>> len(matlab_obj['s'])
1


Okay it’s an array of length 1… What’s its first element?

>>> type(matlab_obj['s'][0])
<class 'numpy.ndarray'>
>>> len(matlab_obj['s'][0])
106


Okay so ‘s’ is an array of one element containing the array of actual data (I remember the study has 106 participants). Now let’s look at the first participant.

>>> type(matlab_obj['s'][0][0])
<class 'numpy.void'>


Okay… what’s a numpy.void? Searching Google for “What’s a numpy.void” gives little, but combing through Stackoverflow answers gives the following:

From what I understand, void is sort of like a Python list since it can hold objects of different data types, which makes sense since the columns in a structured array can be different data types.

So basically, numpy arrays that aren’t homogeneous are ‘voids’ and you can’t tell any information about the values. Great. Well I suppose we could just try *printing* out the value, at the cost of potentially having to kill the process and restart.

>>> matlab_obj['s'][0][0]
([], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [], [[]], [[0]], [], [[0]], [], [[0]], [], [], [], [], [], [], [], [], [], [], [], [], [[413785662906.0]], [], [], [[]], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [])


Okay… so the good news is we didn’t die, but the bad news is that there is almost no data, and there’s no way to tell what the data represents!

As a last straw let’s try printing out the second user.

>>> matlab_obj['s'][0][1]
([[(array([[ 732338.86915509]]), array([[354]], dtype=uint16), array([[-1]], dtype=int16), array(['Packet Data'],
dtype='&lt;U11'), array(['Outgoing'],
dtype='&lt;U8'), array([[0]], dtype=uint8),
...
* a thousand or so lines later *
...
[2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [0, 0], [2, 0], [3, 0], [3, 0], [3, 0], [3, 0], [3, 0]])


So this one has data, but still we have the problem that we have no idea what the various pieces of data represent! We could guess and check, but come on we’re in the 21st century.

## Hopelessness, and then Hope

At this point I gave up on a Python approach. As far as I could tell, the data was in a completely unusable format. My only recourse was to spend two weeks obtaining a copy of Matlab from my institution, spend an hour installing it, and then try to learn enough Matlab to parse out the subset of data I wanted into a universal (plaintext) format, so I could then read it back into Python and analyze it.

Two days after obtaining Matlab, I was referred to a blog post by Conrad Lee who provided the source code he used to parse the data in Python. His code was difficult to read and didn’t work out of the box, so I had to disassemble his solution and craft my own. This and a tutorial I later found, and enough elbow grease, were enough to help me figure things out.

The first big insight was that a numpy.void also acts like a dictionary! There’s no way to tell this programmatically, no keys() attribute on a numpy.void, so apparently it’s just folklore and bad documentation. To get the fields of a subject, we have to know its key (the RealityMining documentation tells you this, with enough typos to keep you guessing).

>>> subjects = matlab_obj['s'][0]
>>> subjects[77]['my_hashedNumber']
array([[78]], dtype=uint8)
>>> subjects[76]['mac']
array([[ 6.19653599e+10]])
>>> subjects[76]['mac'][0][0]
61965359909.0


The ‘mac’ and ‘my_hashedNumber’ are supposed to be identifiers for the subjects. It’s not clear why they have two such identifiers, but what’s worse is that not all participants in the study (not all subjects in the array of subjects, at least) have mac numbers or hashedNumbers. So in order to do anything I had to extract the subset of valid subjects.

def validSubjects(allSubjects):
return [s for s in allSubjects if hasNumeric(s,'mac') and hasNumeric(s,'my_hashedNumber')]


I define the hasNumeric function in the next section. But before that I wanted to keep track of all the stupid ids and produce my own id (contiguous, starting from zero, for valid subjects only) that will replace me having to deal with the stupid matlab subject array. So I did.

def idDicts(subjects):
return (dict((i, s) for (i,s) in enumerate(subjects)),
dict((getNumeric(s,'mac'), (i, s)) for (i,s) in enumerate(subjects)),
dict((getNumeric(s, 'my_hashedNumber'), (i, s)) for (i,s) in enumerate(subjects)))


Given a list of subjects (the valid ones), we create three dictionaries whose keys are the various identifiers, and whose values are the subject objects (and my own id). So we’ll pass around these dictionaries in place of the subject array.

## Extracting Communication Events

The important data here is the set of person-to-person phone calls made over the course of the study. But before we get there we need to inspect how the data is stored more closely. Before I knew about what follows, I was getting a lot of confusing index errors.

Note that the numpy object that gets returned from scipy’s loadmat function stores single-type data differently from compound data, and what’s worse is that it represents a string (which in Python should be a single-type piece of data) as a compound piece of data!

So I wrote helper functions to keep track of checking/accessing these fields. I feel like such a dirty programmer writing code like what follows, but I can still blame everyone else.

Single-type data (usually numeric):

def hasNumeric(obj, field):
try:
obj[field][0][0]
return True
except:
return False

def getNumeric(obj, field):
return obj[field][0][0]


Compound-type data (usually Arrays, but also dicts and strings):

def hasArray(obj, field):
try:
obj[field][0]
return True
except:
return False

def getArray(obj, field):
return obj[field][0]


Note the difference is for single indexing (using [0]) or multiple (using [0][0]). It occurred to me much later that there is an option to loadmat that may get rid of some or all of this doubly-nested array crap, but by then I was too lazy to change everything; it would undoubtedly be the case that some things would break and other wouldn’t (and even if that’s not the case, you can’t blame me for assuming it). I will never understand why the scipy folks chose this representation as the default.

This issue crops up when inspecting a single comm event:

>>> subjects[76]['comm'][0][0]
([[732238.1947106482]], [[0]], [[-1]], ['Voice call'], ['Outgoing'], [[16]], [[4]])


Again this record is a numpy.void with folklore keys, but notice how numeric values are doubly nested in lists while string values are not. Makes me want to puke.

Communication events can be “Incoming,” “Outgoing,” or “Missed,” although the last of these does not exist in the documentation. Not all fields exist in every record. Some are empty lists and some are double arrays containing nan, but you can check by hand that everyone that has a nonempty list of comm events has the needed fields as well:

>>> len([x for x in subjects if len(x['comm']) > 0 and len(x['comm'][0]) > 0])
81
>>> len([x for x in subjects if len(x['comm']) > 0 and len(x['comm'][0]) > 0 and
... all([hasArrayField(e, 'description') for e in x['comm'][0]])])
81


The exception is the duration field. If the duration field does not exist (if the field is an empty list), then it definitely corresponds to the “Missed” direction field.

It could otherwise be that the duration field is zero but the call was not labeled “Missed.” This might happen because they truncate the seconds, and the call lated less than one second (dubious) or because the call was also missed, or rejected without being missed. The documentation doesn’t say. Moreover, “Missed” doesn’t specify if the missed call was incoming or outgoing. So incorporating that data would ruin any chance of having a directed graph representation.

One possibility is that the outgoing calls with a duration of 0 are missed outgoing calls, and incoming calls always have positive duration, but actually there are examples of incoming calls of zero duration. Madness!

After sorting all this out (really, guessing), I was finally able to write a program that extracts the person-to-person call records in a specified date range.

First, to extract all communication events:

def allCommEvents(idDictionary):
events = []
for subjectId, subject in idDictionary.items():
if hasArray(subject, 'comm'):
events.extend([(subjectId, event) for event in getArray(subject, 'comm')])

return events


Then to extract that subset of communication events that were actually phone calls between users within the study:

def callsWithinStudy(commEvents, hashNumDict):
return [(subjectId, e) for (subjectId, e) in commEvents
if (getArray(e, 'description') == "Voice call" and
getNumeric(e, 'hashNum') in hashNumDict)]


Now we can convert the calls into a nicer format (a Python dictionary) and apply some business rules to deal with the ambiguities in the data.

def processCallEvents(callEvents, hashNumDict):
processedCallEvents = []

for subjectId, event in callEvents:
direction = getArray(event, 'direction')
duration = 0 if direction == 'Missed' else getNumeric(event, 'duration')
date = convertDatetime(getNumeric(event, 'date'))
hashNum = getNumeric(event, 'hashNum')
otherPartyId = hashNumDict[hashNum][0]

eventAsDict = {'subjectId': subjectId,
'direction': direction,
'duration': duration,
'otherPartyId': otherPartyId,
'date': date}
processedCallEvents.append(eventAsDict)

print("%d call event dictionaries" % len(processedCallEvents))
return processedCallEvents


All I needed to do was filter them by time now (implement the convertDateTime function used above)…

## Converting datetimes from Matlab to Python

Matlab timestamps are floats like

834758.1231233

The integer part of the float represents the number of days since Jan 1, 0AD. The fractional part represents a fraction of a day, down to seconds I believe. To convert from the matlab Format to the Python format, use

def convertDatetime(dt):
return datetime.fromordinal(int(dt)) + timedelta(days=dt%1) - timedelta(days=366) - timedelta(hours=5)


This is taken without permission from Conrad’s blog (I doubt he’ll care), and he documents his reasoning in a side-post.

This then allows one to filter a list of communication events by time:

def inRange(dateRange, timevalue):
start, end = dateRange
unixTime = int(time.mktime(timevalue.timetuple()))
return start <= unixTime <= end

def filterByDate(dateRange, events):
return [e for e in events if inRange(dateRange, e['date'])]


Phew! Glad I didn’t have to go through that hassle on my own.

## Extracting Survey Data

The dataset includes a survey about friendship and proximity among the participants. That means we have data on

1. Who considers each other in the same “close group of friends.”
2. An estimate on how much time at work each person spends with each other person.
3. An estimate on how much time outside work each person spends with each other person.

This information comes in the form of three lists. These lists sit inside a different numpy.void sitting in the original matlab_object from the beginning of the post.

matlab_obj = scipy.io.loadmat(matlab_filename)
matlab_obj['network'][0][0]


Now the access rules (getNumeric, getArray) we used for the matlab_obj['s'] matrix don’t necessarily apply to the values in the ‘network’ matrix. The friendship survey is accessed by

friends = network['friends']


and it’s an array, each element of which is a participant’s response to the survey (claimed to be stored as a 0-1 array). So, for example, to get the number corresponding to whether id1 considers id2 a close friend, you would use

friends = network['friends'][id1][id2]


This is all straightforward, but there are two complications (stupid trivialities, really). The first is that the matrix is not 0-1 valued, but (nan-1.0)-valued. That’s right, if the individual didn’t consider someone a friend, the record for that is numpy’s nan type, and if they did it’s the float 1.0. So you need to either replace all nan’s with zeros ahead of time, or take account for that in any computations you do with the data.

I don’t know whether this is caused by scipy’s bad Matlab data import or the people producing the data, but it’s a clear reason to not use proprietary file formats to distribute data.

The second problem is that the survey is stored as an array, where the index of the array corresponds to the person the respondent is evaluating for friendship. The problem with this is that not all people are in the survey (for whatever reason, people dropping out or just bad study design), so they include a separate array in the network object called “sub_sort” (a totally unhelpful name). Sub_sort contains in its $i$-th position the hashNum (of the user from the ‘s’ matrix) of “survey-user” $i$ (that is the user whose answer is represented in index $i$ of the network array).

This is incredibly confusing, so let’s explain it with an example. Say that we wanted to parse user 4’s friendship considerations. We’d first access his friends list by

friends = network['friends'][4]


Then, to see whether he considers user 7 as a friend, we’d have to look up user 7 in the sub_sort array (which requries a search through said array) and then use the index of user 7 in that array to index the friends list appropriately. So

subsortArray = network['sub_sort'][0]
user7sIndex = subsortArray.indexof(7)
network['friends'][4][user7sIndex]


The reason this is such terrible design is that an array is the wrong data structure to use. They should have used a list of dictionaries or a dictionary with ordered pairs of hashNums for keys. Yet another reason that matlab (a language designed for numerical matrix manipulation) is an abysmal tool for the job.

The other two survey items are surveys on proximity data. Specifically, they ask the respondent to report how much time per day he or she spends within 10 feet of each other individual in the study.

inLabProximity = network['lab']
outLabProximity = network['outlab']


The values in these matrices are accessed in exactly the same way, with the values being in the set {nan, 1.0, 2.0, 3.0, 4.0, 5.0}. These numbers correspond to various time ranges (the documentation is a bit awkward about this too, using “at least” in place of complete time ranges), but fine. It’s no different from the friends matrix, so at least it’s consistent.

## Cell Tower Data

Having learned all of the gotchas I did, I processed the cell tower data with little fuss. The only bothersome bit was that they provide two “parallel” arrays, one called “locs” containing pairs of (date, cell tower id), and a second called “loc_ids” containing numbers corresponding to a “second” set of simplified cell tower ids. These records correspond to times when a user entered into a region covered by a specific tower, and so it gives very (coarse location) data. It’s not the lat/long values promised by the website, and even though the ids for the cell towers are governed by a national standard, the lat/long values for the towers are not publicly or freely available in database form.

Why they need two sets of cell tower ids is beyond me, but the bothersome part is that they didn’t include the dates in the second array. So you must assume the two arrays are in sync (that the events correspond elementwise by date), and if you want to use the simplified ids you still have to access the “locs” array for the dates (because what use would the tower ids be without dates?).

One can access these fields by ignoring our established rules for accessing subject fields (so inconsistent!). And so to access the parallel arrays in a reasonable way, I used the following:

towerEvents = list(zip(subject['locs'], subject['loc_ids']))


Again the datetimes are in matlab format, so the date conversion function above is used to convert them to python datetimes.

Update: As I’ve continued to work with this data set, I’ve found out that, once again contrary to the documentation, the tower id 1 corresponds to being out of signal. The documentation claims that tower id 0 implies no signal, and this is true of the “locs” array, but for whatever reason they decided to change it to a 1 in the “loc_ids” array. What’s confusing is that there’s no mention of the connection. You have to guess it and check that the lengths of the corresponding lists of events are equal (which is by no means proof that the towers ids are the same, but good evidence).

The documentation is also unclear as to whether the (datestamp, tower id) pair represents the time when the user started using that tower, or the time when the user finished using that tower. As such, I must guess that it’s the former at the risk of producing bad data (and hence bad research).

## Conclusions

I love free data sets. I really do, and I appreciate all the work that researchers put into making and providing their data. But all the time I encounter the most horrible, terrible, no good, very bad data sets! RealityMining is only one small example, and it was actually nice because I was able to figure things out eventually. There are other datasets that have frustrated me much more with inconsistencies and complete absence of documentation.

Is it really that hard to make data useable? Here are the reasons why this was such an annoying and frustrating task:

1. The data was stored in a proprietary format. Please please please release your data in plaintext files!
2. The documentation contradicted the data it was documenting, and was altogether unclear. (One contradiction is too often. This was egregious.)
3. There was even inconsistency within the data. Sometimes missing values were empty lists, sometimes numpy.nan, and sometimes simply missing.
4. Scipy has awful defaults for the loadmat function.
5. Numpy data structures are apparently designed for opacity unless you know the folklore and spend hours poring over documentation.

It seems that most of this could be avoided by fixing 1 and 2, since then it’s a problem that an automated tool like Google Refine can fix (though I admit to never having tried Refine, because it crashed on the only dataset I’ve ever given it). I know scientists have more important things to worry about, and I could just blame it on the 2004 origin of the data set, but in the age of reproducibility and big data (and seeing that this data set is widely used in social network research), it’s just as important to release the data as it is to make the data fit for consumption. RealityMining is like a math paper with complicated terms defined only by example, and theorems stated without proof that are actually false. It doesn’t uphold the quality that science must, it makes me question the choices made by other researchers who have had to deal with the dataset, and it wastes everyone else’s time.

In this respect, I’ve decided to post my Python code in a github repository, as is usual on this blog. You’ll have to get the actual RealityMining dataset yourself if you want to use it, but once you have it you can run my program (it depends only on the numpy/scipy stack and the standard library; tested in Python 3.3.3) to get the data I’m interested in or modify it for your own purposes.

Moral of the story: For the sake of science, please don’t distribute a database as a Matlab matrix! Or anything else for that matter!

# Categories as Types

In this post we’ll get a quick look at two ways to define a category as a type in ML. The first way will be completely trivial: we’ll just write it as a tuple of functions. The second will involve the terribly-named “functor” expression in ML, which allows one to give a bit more structure on data types.

The reader unfamiliar with the ML programming language should consult our earlier primer.

## What Do We Want?

The first question to ask is “what do we want out of a category?” Recall from last time that a category is a class of objects, where each pair of objects is endowed with a set of morphisms. The morphisms were subject to a handful of conditions, which we paraphrase below:

• Composition has to make sense when it’s defined.
• Composition is associative.
• Every object has an identity morphism.

What is the computational heart of such a beast? Obviously, we can’t explicitly represent the class of objects itself. Computers are finite (or at least countable) beings, after all. And in general we can’t even represent the set of all morphisms explicitly. Think of the category $\mathbf{Set}$: if $A, B$ are infinite, then there are infinitely many morphisms (set-functions) $A \to B$.

A subtler and more important point is that everything in our representation of a category needs to be a type in ML (for indeed, how can a program run without being able to understand the types involved). This nudges us in an interesting direction: if an object is a type, and a morphism is a type, then we might reasonably expect a composition function:

compose: 'arrow * 'arrow -> 'arrow

We might allow this function to raise an exception if the two morphisms are uncomposable. But then, of course, we need a way to determine if morphisms are composable, and this requires us to know what the source and target functions are.

source: 'arrow -> 'object
target: 'arrow -> 'object

We can’t readily enforce composition to be associative at this stage (or any), because morphisms don’t have any other manipulable properties. We can’t reasonably expect to be able to compare morphisms for equality, as we aren’t able to do this in $\mathbf{Set}$.

But we can enforce the final axiom: that every object has an identity morphism. This would come in the form of a function which accepts as input an object and produces a morphism:

identity: 'object -> 'arrow

But again, we can’t necessarily enforce that identity behaves as its supposed to.

Even so, this should be enough to define a category. That is, in ML, we have the datatype

datatype ('object, 'arrow)Category =
category of ('arrow -> 'object) *
('arrow -> 'object) *
('object -> 'arrow) *
('arrow * 'arrow -> 'arrow)

Where we understand the first two functions to be source and target, and the third and fourth to be identity and compose, respectively.

## Categories as Values

In order to see this type in action, we defined (and included in the source code archive for this post) a type for homogeneous sets. Since ML doesn’t support homogeneous lists natively, we’ll have to settle for a particular subcategory of $\mathbf{Set}$. We’ll call it $\mathbf{Set}_a$, the category whose objects are finite sets whose elements have type $a$, and whose morphisms are again set-functions. Hence, we can define an object in this category as the ML type

'a Set

and an arrow as a datatype

datatype 'a SetMap = setMap of ('a Set) * ('a -> 'a) * ('a Set)

where the first object is the source and the second is the target (mirroring the notation $A \to B$).

Now we must define the functions required for the data constructor of a Category. Source and target are trivial.

fun setSource(setMap(a, f, b)) = a
fun setTarget(setMap(a, f, b)) = b

And identity is equally natural.

fun setIdentity(aSet) = setMap(aSet, (fn x => x), aSet)

Finally, compose requires us to check for set equality of the relevant source and target, and compose the bundled set-maps. A few notational comments: the “o” operator does function composition in ML, and we defined “setEq” and the “uncomposable” exception elsewhere.

fun setCompose(setMap(b', g, c), setMap(a, f, b)) =
if setEq(b, b') then
setMap(a, (g o f), c)
else
raise uncomposable

Note that the second map in the composition order is the first argument to the function, as is standard in mathematical notation.

And now this category of finite sets is just the instantiation

val FiniteSets = category(setSource, setTarget, setIdentity, setCompose)

Oddly enough, this definition does not depend on the choice of a type for our sets! None of the functions we defined care about what the elements of sets are, so long as they are comparable for equality (in ML, such a type is denoted with two leading apostrophes: ”a).

This example is not very interesting, but we can build on it in two interesting ways. The first is to define the poset category of sets. Recall that the poset category of a set $X$ has as objects subsets of $X$ and there is a unique morphism from $A \to B$ if and only if $A \subset B$. There are two ways to model this in ML. The first is that we can assume the person constructing the morphisms is giving correct data (that is, only those objects which satisfy the subset relation). This would amount to a morphism just representing “nothing” except the pair $(A,B)$. On the other hand, we may also require a function which verifies that the relation holds. That is, we could describe a morphism in this (or any) poset category as a datatype

datatype 'a PosetArrow = ltArrow of 'a * ('a -> 'a -> bool) * 'a

We still do need to assume the user is not creating multiple arrows with different functions (or we must tacitly call them all equal regardless of the function). We can do a check that the function actually returns true by providing a function for creating arrows in this particular poset category of a set with subsets.

exception invalidArrow
fun setPosetArrow(x)(y) = if subset(x)(y) then ltArrow(x, subset, y)
else raise invalidArrow

Then the functions defining a poset category are very similar to those of a set. The only real difference is in the identity and composition functions, and it is minor at that.

fun posetSource(ltArrow(x, f, y)) = x
fun posetTarget(ltArrow(x, f, y)) = y
fun posetIdentity(x) = ltArrow(x, (fn z => fn w => true), x)
fun posetCompose(ltArrow(b', g, c), ltArrow(a, f, b)) =
if b = b' then
ltArrow(a, (fn x => fn y => f(x)(b) andalso g(b)(y)), c)
else
raise uncomposable

We know that a set is always a subset of itself, so we can provide the constant true function in posetIdentity. In posetCompose, we assume things are transitive and provide the logical conjunction of the two verification functions for the pieces. Then the category as a value is

val Posets = category(posetSource, posetTarget, posetIdentity, posetCompose)

One will notice again that the homogeneous type of our sets is irrelevant. The poset category is parametric. To be completely clear, the type of the Poset category defined above is

('a Set, ('a Set)PosetArrow)Category

and so we can define a shortcut for this type.

type 'a PosetCategory = ('a Set, ('a Set)PosetArrow)Category

The only way we could make this more general is to pass the constructor for creating a particular kind of posetArrow (in our case, it just means fixing the choice of verification function, subset, for the more generic type constructor). We leave this as an exercise to the reader.

Our last example will return again to our abstract “diagram category.” Recall that if we fix an object $A$ in a category $\mathbf{C}$, the category $\mathbf{C}_A$ has as objects the morphisms with fixed source $A$,

and the arrows are commutative diagrams

Lets fix $A$ to be a set, and define a function to instantiate objects in this category:

fun makeArrow(fixedSet)(f)(target) = setMap(fixedSet, f, target)

That is, if I have a fixed set $A$, I can call the function like so

val makeObject = makeArrow(fixedSet)

And use this function to create all of my objects. A morphism is then just a pair of set-maps with a connecting function. Note how similar this snippet of code looks structurally to the other morphism datatypes we’ve defined. This should reinforce the point that morphisms really aren’t any different in this abstract category!

datatype 'a TriangleDiagram =
triangle of ('a SetMap) * ('a -> 'a) * ('a SetMap)

And the corresponding functions and category instantiation are

fun triangleSource(triangle(x, f, y)) = x
fun triangleTarget(triangle(x, f, y)) = y
fun triangleIdentity(x) = triangle(x, (fn z => z), x)
fun triangleCompose(triangle(b', g, c), triangle(a, f, b)) =
if setEq(setTarget(b'))(setTarget(b)) then
triangle(a, (g o f), c)
else
raise uncomposable

val SetDiagrams =
category(triangleSource, triangleTarget, triangleIdentity, triangleCompose)

Notice how we cannot compare the objects in this category for equality! Indeed, two functions cannot be compared in ML, and the best we can do is compare the targets of these functions to ensure that the connecting morphisms can be composed. A malicious programmer might then be able to compose uncomposable morphisms, a devious and startling proposition.

Notice further how we can’t avoid using set-specific functions with our current definition of a category. What we could do instead is require a category to have a function which checks for composability. Then compose could be written in a more homogeneous (but not quite completely parametric fashion). The reader is welcome to try this on their own, but we’ll soon have a more organized way to do this later, when we package up a category into an ML structure.

## Categories as Signatures

Let’s look at a more advanced feature of ML and see how we can apply it to representing categories in code. The idea is familiar to most Java and C++ programmers, and that is of an interface: the programmer specifies an abstract type which has specific functions attached to it, but does not implement those functions. Then some (perhaps different) programmer implements the interface by defining those functions for a concrete type.

For the reader who isn’t so interested in these features of ML, feel free to skip this section. We won’t be introducing any new examples of categories, just rephrasing what we’ve done above in a different way. It is also highly debatable whether this new way is actually any better (it’s certainly longer).

The corresponding concept in ML is called a signature (interface) and a structure (concrete instance). The only difference is that a structure is removed one step further from ML’s type system. This is in a sense necessary: a structure should be able to implement the requirements for many different signatures, and a signature should be able to have many different structures implement it. So there is no strict hierarchy of types to lean on (it’s just a many-to-many relationship).

On the other hand, not having any way to think of a structure as a type at all would be completely useless. The whole point is to be able to write function which manipulate a structure (concrete instance) by abstractly accessing the signature’s (interface’s) functions regardless of the particular implementation details. And so ML adds in another layer of indirection: the functor (don’t confuse this with categorical functors, which we haven’t talked abou yet). A functor in ML is a procedure which accepts as input a structure and produces as output another structure.

Let’s see this on a simple example. We can define a structure for “things that have magnitudes,” which as a signature would be

signature MAG_OBJ =
sig
type object
val mag : object -> int
end

This introduced a few new language forms: the “signature” keyword is like “fun” or “val,” it just declares the purpose of the statement as a whole. The “sig … end” expression is what actually creates the signature, and the contents are a list of type definitions, datatype definitions, and value/function type definitions. The reader should think of this as a named “type” for structures, analogous to a C struct.

To implement a signature in ML is to “ascribe” it. That is, we can create a structure that defines the functions laid out in a signature (and perhaps more). Here is an example of a structure for a mag object of integers:

structure IntMag : MAG_OBJ =
struct
type object = int
fun mag(x) = if x >= 0 then x else ~x
end

The colon is a type ascription, and at the compiler-level it goes through the types and functions defined in this structure and attempts to match them with the required ones from the MAG_OBJ signature. If it fails, it raises a compiler error. A more detailed description of this process can be found here. We can then use the structure as follows:

local open IntMag in
val y = ~7
val z = mag(y)
end

The open keyword binds all of the functions and types to the current environment (and so it’s naturally a bad idea to open structures in the global environment). The “local” construction is made to work specifically with open.

In any case, the important part about signatures and structures is that one can write functions which accept as input structures with a given signature and produce structures with other signatures. The name for this is a “functor,” but we will usually call it an “ML functor” to clarify that it is not a categorical functor (if you don’t know what a categorical functor is, don’t worry yet).

For example, here is a signature for a “distance object”

signature DIST_OBJ =
sig
type object
val dist: object * object -> int
end

And a functor which accepts as input a MAG_OBJ and creates a DIST_OBJ in a contrived and silly way.

functor MakeDistObj(structure MAG: MAG_OBJ) =
struct
type object = MAG.object
val dist = fn (x, y) =>
let
val v = MAG.mag(x) - MAG.mag(y)
in
if v > 0 then v else ~v
end
end

Here the argument to the functor is a structure called MAG, and we need to specify which type is to be ascribed to it; in this case, it’s a MAG_OBJ. Only things within the specified signature can be used within the struct expression that follows (if one needs a functor to accept as input something which has multiple type ascriptions, one can create a new signature that “inherits” from both). Then the right hand side of the functor is just a new structure definition, where the innards are allowed to refer to the properties of the given MAG_OBJ.

We could in turn define a functor which accepts a DIST_OBJ as input and produces as output another structure, and compose the two functors. This is started to sound a bit category-theoretical, but we want to remind the reader that a “functor” in category theory is quite different from a “functor” in ML (in particular, we have yet to mention the former, and the latter is just a mechanism to parameterize and modularize code).

In any case, we can do the same thing with a category. A general category will be represented by a signature, and a particular instance of a category is a structure with the implemented pieces.

signature CATEGORY =
sig
type object
type arrow
exception uncomposable

val source: arrow -> object
val target: arrow -> object
val identity: object -> arrow
val composable: arrow * arrow -> bool
val compose: arrow * arrow -> arrow
end

As it turns out, ML has an implementation of “sets” already, and it uses signatures. So instead of using our rinky-dink implementation of sets from earlier in this post, we can implement an instance of the ORD_KEY signature, and then call one of the many set-creating functors available to us. We’ll use an implementation based on red-black trees for now.

First the key:

structure OrderedInt : ORD_KEY =
struct
type ord_key = int
val compare = Int.compare
end

then the functor:

functor MakeSetCategory(structure ELT: ORD_KEY) =
struct
structure Set:ORD_SET = RedBlackSetFn(ELT)
type elt = ELT.ord_key
type object = Set.set
datatype arrow = function of object * (elt -> elt) * object

exception uncomposable

fun source(function(a, _, _)) = a
fun target(function(_, _, b)) = b
fun identity(a) = function(a, fn x => x, a)
fun composable(a,b) = Set.equal(a,b)
fun compose(function(b2, g, c), function(a, f, b1)) =
if composable(b1, b2) then
function(a, (g o f), c)
else
raise uncomposable
fun apply(function(a, f, b), x) = f(x)
end

The first few lines are the most confusing; the rest is exactly what we’ve seen from our first go-round of defining the category of sets. In particular, we call the RedBlackSetFn functor given the ELT structure, and it produces an implementation of sets which we name Set (and superfluously ascribe the type ORD_SET for clarity).

Then we define the “elt” type which is used to describe an arrow, the object type as the main type of an ORD_SET (see in this documentation that this is the only new type available in that signature), and the arrow type as we did earlier.

We can then define the category of sets with a given type as follows

structure IntSets = MakeSetCategory(structure ELT = OrderedInt)

The main drawback of this approach is that there’s so much upkeep! Every time we want to make a new kind of category, we need to define a new functor, and every time we want to make a new kind of category of sets, we need to implement a new kind of ORD_KEY. All of this signature and structure business can be quite confusing; it often seems like the compiler doesn’t want to cooperate when it should.

## Nothing Too Shocking So Far

To be honest, we haven’t really done anything very interesting so far. We’ve seen a single definition (of a category), looked at a number of examples to gain intuition, and given a flavor of how these things will turn into code.

Before we finish, let’s review the pros and cons of a computational representation of category-theoretical concepts.

Pros:

• We can prove results by explicit construction (more on this next time).
• Different-looking categories are structurally similar in code.
• We can faithfully represent the idea of using (objects and morphisms of) categories as parameters to construct other categories.
• Writing things in code gives a fuller understanding of how they work.

Cons:

• All computations are finite, requiring us to think much harder than a mathematician about the definition of an object.
• The type system is too weak. we can’t enforce the axioms of a category directly or even ensure the functions act sensibly at all. As such, the programmer becomes responsible for any failure that occur from bad definitions.
• The type system is too strong. when we work concretely we are forced to work within homogeneous categories (e.g. of ‘a Set).
• We cannot ensure the ability to check equality on objects. This showed up in our example of diagram categories.
• The functions used in defining morphisms, e.g. in the category of sets, are somewhat removed from real set-functions. For example, nothing about category theory requires functions to be computable. Moreover, nothing about our implementation requires the functions to have any outputs at all (they may loop infinitely!). Moreover, it is not possible to ensure that any given function terminates on a given input set (this is the Halting problem).

This list looks quite imbalanced, but one might argue that the cons are relatively minor compared to the pros. In particular (and this is what this author hopes to be the case), being able to explicitly construct proofs to theorems in category theory will give one a much deeper understanding both of category theory and of programming. This is the ultimate prize.

Next time we will begin our investigation of universal properties (where the real fun starts), and we’ll use the programmatic constructions we laid out in this post to give constructive proofs that various objects are universal.

Until then!

# Depth- and Breadth-First Search

The graph is among the most common data structures in computer science, and it’s unsurprising that a staggeringly large amount of time has been dedicated to developing algorithms on graphs. Indeed, many problems in areas ranging from sociology, linguistics, to chemistry and artificial intelligence can be translated into questions about graphs. It’s no stretch to say that graphs are truly ubiquitous. Even more, common problems often concern the existence and optimality of paths from one vertex to another with certain properties.

Of course, in order to find paths with certain properties one must first be able to search through graphs in a structured way. And so we will start our investigation of graph search algorithms with the most basic kind of graph search algorithms: the depth-first and breadth-first search. These kinds of algorithms existed in mathematics long before computers were around. The former was ostensibly invented by a man named Pierre Tremaux, who lived around the same time as the world’s first formal algorithm designer Ada Lovelace. The latter was formally discovered much later by Edward F. Moore in the 50’s. Both were discovered in the context of solving mazes.

These two algorithms nudge gently into the realm of artificial intelligence, because at any given step they will decide which path to inspect next, and with minor modifications we can “intelligently” decide which to inspect next.

Of course, this primer will expect the reader is familiar with the basic definitions of graph theory, but as usual we provide introductory primers on this blog. In addition, the content of this post will be very similar to our primer on trees, so the familiar reader may benefit from reading that post as well. As usual, all of the code used in this post is available on this blog’s Github page.

## The Basic Graph Data Structure

We present the basic data structure in both mathematical terms and explicitly in Python.

Definition: A directed graph is a triple $G = (V, E, \varphi)$ where $V$ is a set of vertices, $E$ is a set of edges, and $\varphi: E \to V \times V$ is the adjacency function, which specifies which edges connect which vertices (here edges are ordered pairs, and hence directed).

We will often draw pictures of graphs instead of explicitly specifying their adjacency functions:

That is, the vertex set is $\left \{ A,B,C,D,E,F \right \}$, the edge set (which is unlabeled in the picture) has size 6, and the adjacency function just formalizes which edges connect which vertices.

An undirected graph is a graph in which the edges are (for lack of a better word) undirected. There are two ways to realize this rigorously: one can view the codomain of the adjacency function $\varphi$ as the set of subsets of size 2 of $V$ as opposed to ordered pairs, or one can enforce that whenever $\varphi(e) = (v_1, v_2)$ is a directed edge then so is $(v_2, v_1)$. In our implementations we will stick to directed graphs, and our data structure will extend nicely to use the second definition if undirectedness is needed.

For the purpose of finding paths and simplicity in our derivations, we will impose two further conditions. First, the graphs must be simple. That is, no graph may have more than one edge between two vertices or self-adjacent vertices (that is, $(v,v)$ can not be an edge for any vertex $v$). Second, the graphs must be connected. That is, all pairs of vertices must have a path connecting them.

Our implementation of these algorithms will use a variation of the following Python data structure. This code should be relatively self-explanatory, but the beginning reader should consult our primers on the Python programming language for a more thorough explanation of classes and lists.

class Node:
def __init__(self, value):
self.value = value
self.adjacentNodes = []

The entire graph will be accessible by having a reference to one vertex (this is guaranteed by connectivity). The vertex class is called Node, and it will contain as its data a value of arbitrary type and a list of neighboring vertices. For now the edges are just implicitly defined (there is an edge from $v$ to $w$ if the latter shows up in the “adjacentNodes” list of the former), but once we need edges with associated values we will have to improve this data structure to one similar to what we used in our post on neural networks.

That is, one must update the adjacentNodes attribute of each Node by hand to add edges to the graph. There are other data structures for graphs that allow one to refer to any vertex at any time, but for our purposes the constraint of not being able to do that is more enlightening. The algorithms we investigate will use no more and no less than what we have. At each stage we will inspect the vertex we currently have access to, and pick some action based on that.

Enough talk. Let’s jump right in!

## Depth-First Search

For the remainder of this section, the goal is to determine if there is a vertex in the graph with the associated value being the integer 6. This can also be phrased more precisely as the question: “is there a path from the given node to a node with value 6?” (For connected. undirected graphs the two questions are equivalent.)

Our first algorithm will solve this problem quite nicely, and is called the depth-first search. Depth-first search is inherently a recursion:

1. Start at a vertex.
2. Pick any unvisited vertex adjacent to the current vertex, and check to see if this is the goal.
3. If not, recursively apply the depth-first search to that vertex, ignoring any vertices that have already been visited.
4. Repeat until all adjacent vertices have been visited.

This is probably the most natural algorithm in terms of descriptive simplicity. Indeed, in the case that our graph is a tree, this algorithm is precisely the preorder traversal.

Aside from keeping track of which nodes have already been visited, the algorithm is equally simple:

def depthFirst(node, soughtValue):
if node.value == soughtValue:
return True



Of course, supposing 6 is not found, any graph which contains a cycle (or any nontrivial, connected, undirected graph) will cause this algorithm to loop infinitely. In addition, and this is a more subtle engineering problem, graphs with a large number of vertices will cause this function to crash by exceeding the maximum number of allowed nested function calls.

To avoid the first problem we can add an extra parameter to the function: a Python set type which contains the set of Nodes which have already been visited. Python sets are the computational analogue of mathematical sets, meaning that their contents are unordered and have no duplicates. And functionally Python sets have fast checks for inclusion and addition operations, so this fits the bill quite nicely.

The updated code is straightforward:

def depthFirst(node, soughtValue, visitedNodes):
if node.value == soughtValue:
return True

if adjNode not in visitedNodes:
if depthFirst(adjNode, soughtValue, visitedNodes):
return True

return False


There are a few tricky things going on in this code snippet. First, after checking the current Node for the sought value, we add the current Node to the set of visitedNodes. While subsequently iterating over the adjacent Nodes we check to make sure the Node has not been visited before recursing. Since Python passes these sets by reference, changes made to visitedNodes deep in the recursion persist after the recursive call ends. That is, much the same as lists in Python, these updates mutate the object.

Second, this algorithm is not crystal clear on how the return values operate. Each recursive call returns True or False, but because there are arbitrarily many recursive calls made at each vertex, we can’t simply return the result of a recursive call. Instead, we can only know that we’re done entirely when a recursive call specifically returns True (hence the test for it in the if statement). Finally, after all recursive calls have terminated (and they’re all False), the end of the function defaults to returning False; in this case the sought vertex was never found.

Let’s try running our fixed code with some simple examples. In the following example, we have stored in the variable G the graph given in the picture in the previous section.

>>> depthFirst(G, "A")
True
>>> depthFirst(G, "E")
True
>>> depthFirst(G, "F")
True
>>> depthFirst(G, "X")
False

Of course, this still doesn’t fix the problem with too many recursive calls; a graph which is too large will cause an error because Python limits the number of recursive calls allowed. The next and final step is a standard method of turning a recursive algorithm into a non-recursive one, and it requires the knowledge of a particular data structure called a stack.

In the abstract, one might want a structure which stores a collection of items and has the following operations:

• You can quickly add something to the structure.
• You can quickly remove the most recently added item.

Such a data structure is called a stack. By “quickly,” we really mean that these operations are required to run in constant time with respect to the size of the list (it shouldn’t take longer to add an item to a long list than to a short one). One imagines a stack of pancakes, where you can either add a pancake to the top or remove one from the top; no matter how many times we remove pancakes, the one on top is always the one that was most recently added to the stack. These two operations are called push (to add) and pop (to remove). As a completely irrelevant aside, this is not the only algorithmic or mathematical concept based on pancakes.

In other languages, one might have to implement such a data structure by hand. Luckily for us, Python’s lists double as stacks (although in the future we plan some primers on data structure design). Specifically, the append function of a list is the push operation, and Python lists have a special operation uncoincidentally called pop which removes the last element from a list and returns it to the caller.

Here is some example code showing this in action:

>>> L = [1,2,3]
>>> L.append(9)
>>> L
[1,2,3,9]
>>> L.pop()
9
>>> L
[1,2,3]

Note that pop modifies the list and returns a value, while push/append only modifies the list.

It turns out that the order in which we visit the vertices in the recursive version of the depth-first search is the same as if we had done the following. At each vertex, push the adjacent vertices onto a stack in the reverse order that you iterate through the list. To choose which vertex to process next, simple pop whatever is on top of the stack, and process it (taking the stack with you as you go). Once again, we have to worry about which vertices have already been visited, and that part of the algorithm remains unchanged.

For example, say we have the following graph:

Starting at vertex 1, which is adjacent to vertices 2, 3 and 4, we push 4 onto the stack, then 3, then 2. Next, we pop vertex 2 and iteratively process 2. At this point the picture looks like this:

Since 2 is connected to 4 and also 5, we push 5 and then 4 onto the stack. Note that 4 is in the stack twice (this is okay, since we are maintaining a set of visited vertices). Now the important part is that since we added vertex 5 and 4 after adding vertex 3, those will be processed before vertex 3. That is, the neighbors of more recently visited vertices (vertex 2) have a preference in being processed over the remaining neighbors of earlier ones (vertex 1). This is precisely the idea of recursion: we don’t finish the recursive call until all of the neighbors are processed, and that in turn requires the processing of all of the neighbors’ neighbors, and so on.

As a quick side note: it should be clear by now that the order in which we visit adjacent nodes is completely arbitrary in both versions of this algorithm. There is no inherent ordering on the edges of a vertex in a graph, and so adding them in reverse order is simply a way for us to mentally convince ourselves that the same preference rules apply with the stack as with recursion. That is, whatever order we visit them in the recursive version, we must push them onto the stack in the opposite order to get an identical algorithm. But in isolation neither algorithm requires a particular order. So henceforth, we will stop adding things in “reverse” order in the stack version.

Now the important part is that once we have converted the recursive algorithm into one based on a stack, we can remove the need for recursion entirely. Instead, we use a loop that terminates when the stack is empty:

def depthFirst(startingNode, soughtValue):
visitedNodes = set()
stack = [startingNode]

while len(stack) > 0:
node = stack.pop()
if node in visitedNodes:
continue

if node.value == soughtValue:
return True

for n in node.adjacentNodes:
if n not in visitedNodes:
stack.append(n)
return False


This author particularly hates the use of “continue” in while loops, but its use here is better than any alternative this author can think of. For those unfamiliar: whenever a Python program encounters the continue statement in a loop, it skips the remainder of the body of the loop and begins the next iteration. One can also combine the last three lines of code into one using the lists’s extend function in combination with a list comprehension. This should be an easy exercise for the reader.

Moreover, note that this version of the algorithm removes the issue with the return values. It is quite easy to tell when we’ve found the required node or determined it is not in the graph: if the loop terminates naturally (that is, without hitting a return statement), then the sought value doesn’t exist.

The reliance of this algorithm on a data structure is not an uncommon thing. In fact, the next algorithm we will see cannot be easily represented as a recursive phenomenon; the order of traversal is simply too different. Instead, it will be almost identical to the stack-form of the depth-first search, but substituting a queue for a stack.

As the name suggests, the breadth-first search operates in the “opposite” way from the depth-first search. Intuitively the breadth-first search prefers to visit the neighbors of earlier visited nodes before the neighbors of more recently visited ones. Let us reexamine the example we used in the depth-first search to see this change in action.

Starting again with vertex 1, we add 4, 3, and 2 (in that order) to our data structure, but now we prefer the first thing added to our data structure instead of the last. That is, in the next step we visit vertex 4 instead of vertex 2. Since vertex 4 is adjacent to nobody, the recursion ends and we continue with vertex 3.

Now vertex 3 is adjacent to 5, so we add 5 to the data structure. At this point the state of the algorithm can be displayed like this:

The “Data Structure” has the most recently added items on top. A red “x” denotes a vertex which has already been visited by the algorithm at this stage.

That is, and this is the important bit, we process vertex 2 before we process vertex 5. Notice the pattern here: after processing vertex 1, we processed all of the neighbors of vertex 1 before processing any vertices not immediately adjacent to vertex one. This is where the “breadth” part distinguishes this algorithm from the “depth” part. Metaphorically, a breadth-first search algorithm will look all around a vertex before continuing on into the depths of a graph, while the depth-first search will dive straight to the bottom of the ocean before looking at where it is. Perhaps one way to characterize these algorithms is to call breadth-first cautious, and depth-first hasty. Indeed, there are more formal ways to make these words even more fitting that we will discuss in the future.

The way that we’ll make these rules rigorous is in the data-structure version of the algorithm: instead of using a stack we’ll use a queue. Again in the abstract, a queue is a data structure for which we’d like the following properties:

• We can quickly add items to the queue.
• We can quickly remove the least recently added item.

The operations on a queue are usually called enqueue (for additions) and dequeue (for removals).

Again, Python’s lists have operations that functionally make them queues, but the analogue of the enqueue operation is not efficient (specifically, it costs $O(n)$ for a list of size $n$). So instead we will use Python’s special deque class (pronounced “deck”). Deques are nice because they allow fast addition and removal from both “ends” of the structure. That is, deques specify a “left” end and a “right” end, and there are constant-time operations to add and remove from both the left and right ends.

Hence the enqueue operation we will use for a deque is called “appendleft,” and the dequeue operation is (unfortunately) called “pop.”

>>> from collections import deque
>>> queue = deque()
>>> queue.appendleft(7)
>>> queue.appendleft(4)
>>> queue
[4,7]
>>> queue.pop()
7
>>> queue
[4]

Note that a deque can also operate as a stack (it also has an append function with functions as the push operation). So in the following code for the breadth-first search, the only modification required to make it a depth-first search is to change the word “appendleft” to “append” (and to update the variable names from “queue” to “stack”).

And so the code for the breadth-first search algorithm is essentially identical:

from collections import deque

visitedNodes = set()
queue = deque([startingNode])

while len(queue) > 0:
node = queue.pop()
if node in visitedNodes:
continue

if node.value == soughtValue:
return True

for n in node.adjacentNodes:
if n not in visitedNodes:
queue.appendleft(n)
return False


As in the depth-first search, one can combine the last three lines into one using the deque’s extendleft function.

We leave it to the reader to try some examples of running this algorithm (we repeated the example for the depth-first search in our code, but omit it for brevity).

## Generalizing

After all of this exploration, it is clear that the depth-first search and the breadth-first search are truly the same algorithm. Indeed, the only difference is in the data structure, and this can be abstracted out of the entire procedure. Say that we have some data structure that has three operations: add, remove, and len (the Pythonic function for “query the size”). Then we can make a search algorithm that uses this structure without knowing how it works on the inside. Since words like stack, queue, and heap are already taken for specific data structures, we’ll call this arbitrary data structure a pile. The algorithm might look like the following in Python:

def search(startingNode, soughtValue, pile):
visitedNodes = set()
nodePile = pile()

while len(nodePile) > 0:
node = nodePile.remove()
if node in visitedNodes:
continue

if node.value == soughtValue:
return True

for n in node.adjacentNodes:
if n not in visitedNodes:
return False


Note that the argument “pile” passed to this function is the constructor for the data type, and one of the first things we do is call it to create a new instance of the data structure for use in the rest of the function.

And now, if we wanted, we could recreate the depth-first search and breadth-first search as special cases of this algorithm. Unfortunately this would require us to add new methods to a deque, which is protected from such devious modifications by the Python runtime system. Instead, we can create a wrapper class as follows:

from collections import deque

class MyStack(deque):