# On Coloring Resilient Graphs

I’m pleased to announce that another paper of mine is finished. This one just got accepted to MFCS 2014, which is being held in Budapest this year (this whole research thing is exciting!). This is joint work with my advisor, Lev Reyzin. As with my first paper, I’d like to explain things here on my blog a bit more informally than a scholarly article allows.

## A Recent History of Graph Coloring

One of the first important things you learn when you study graphs is that coloring graphs is hard. Remember that coloring a graph with $k$ colors means that you assign each vertex a color (a number in $\left \{ 1, 2, \dots, k \right \}$) so that no vertex is adjacent to a vertex of the same color (no edge is monochromatic). In fact, even deciding whether a graph can be colored with just $3$ colors (not to mention finding such a coloring) has no known polynomial time algorithm. It’s what’s called NP-hard, which means that almost everyone believes it’s hopeless to solve efficiently in the worst case.

One might think that there’s some sort of gradient to this problem, that as the graphs get more “complicated” it becomes algorithmically harder to figure out how colorable they are. There are some notions of “simplicity” and “complexity” for graphs, but they hardly fall on a gradient. Just to give the reader an idea, here are some ways to make graph coloring easy:

• Make sure your graph is planar. Then deciding 4-colorability is easy because the answer is always yes.
• Make sure your graph is triangle-free and planar. Then finding a 3-coloring is easy.
• Make sure your graph is perfect (which again requires knowledge about how colorable it is).
• Make sure your graph has tree-width or clique-width bounded by a constant.
• Make sure your graph doesn’t have a certain kind of induced subgraph (such as having no induced paths of length 4 or 5).

Let me emphasize that these results are very difficult and tricky to compare. The properties are inherently discrete (either perfect or imperfect, planar or not planar). The fact that the world has not yet agreed upon a universal measure of complexity for graphs (or at least one that makes graph coloring easy to understand) is not a criticism of the chef but a testament to the challenge and intrigue of the dish.

Coloring general graphs is much bleaker, where the focus has turned to approximations. You can’t “approximate” the answer to whether a graph is colorable, so now the key here is that we are actually trying to find an approximate coloring. In particular, if you’re given some graph $G$ and you don’t know the minimum number of colors needed to color it (say it’s $\chi(G)$, this is called the chromatic number), can you easily color it with what turns out to be, say, $2 \chi(G)$ colors?

Garey and Johnson (the gods of NP-hardness) proved this problem is again hard. In fact, they proved that you can’t do better than twice the number of colors. This might not seem so bad in practice, but the story gets worse. This lower bound was improved by Zuckerman, building on the work of Håstad, to depend on the size of the graph! That is, unless $P=NP$, all efficient algorithms will use asymptotically more than $\chi(G) n^{1 - \varepsilon}$ colors for any $\varepsilon > 0$ in the worst case, where $n$ is the number of vertices of $G$. So the best you can hope for is being off by something like a multiplicative factor of $n / \log n$. You can actually achieve this (it’s nontrivial and takes a lot of work), but it carries that aura of pity for the hopeful graph colorer.

The next avenue is to assume you know the chromatic number of your graph, and see how well you can do then. For example: if you are given the promise that a graph $G$ is 3-colorable, can you efficiently find a coloring with 8 colors? The best would be if you could find a coloring with 4 colors, but this is already known to be NP-hard.

The best upper bounds, algorithms to find approximate colorings of 3-colorable graphs, also pitifully depend on the size of the graph. Remember I say pitiful not to insult the researchers! This decades-long line of work was extremely difficult and deserves the highest praise. It’s just frustrating that the best known algorithm to color a 3-colorable graph requires as many as $n^{0.2}$ colors. At least it bypasses the barrier of $n^{1 - \varepsilon}$ mentioned above, so we know that knowing the chromatic number actually does help.

The lower bounds are a bit more hopeful; it’s known to be NP-hard to color a $k$-colorable graph using $2^{\sqrt[3]{k}}$ colors if $k$ is sufficiently large. There are a handful of other linear lower bounds that work for all $k \geq 3$, but to my knowledge this is the best asymptotic result. The big open problem (which I doubt many people have their eye on considering how hard it seems) is to find an upper bound depending only on $k$. I wonder offhand whether a ridiculous bound like $k^{k^k}$ colors would be considered progress, and I bet it would.

## Our Idea: Resilience

So without big breakthroughs on the front of approximate graph coloring, we propose a new front for investigation. The idea is that we consider graphs which are not only colorable, but remain colorable under the adversarial operation of adding a few new edges. More formally,

Definition: A graph $G = (V,E)$ is called $r$-resiliently $k$-colorable if two properties hold

1. $G$ is $k$-colorable.
2. For any set $E'$ of $r$ edges disjoint from $E$, the graph $G' = (V, E \cup E')$ is $k$-colorable.

The simplest nontrivial example of this is 1-resiliently 3-colorable graphs. That is a graph that is 3-colorable and remains 3-colorable no matter which new edge you add. And the question we ask of this example: is there a polynomial time algorithm to 3-color a 1-resiliently 3-colorable graph? We prove in our paper that this is actually NP-hard, but it’s not a trivial thing to see.

The chief benefit of thinking about resiliently colorable graphs is that it provides a clear gradient of complexity from general graphs (zero-resilient) to the empty graph (which is $(\binom{k+1}{2} - 1)$-resiliently $k$-colorable). We know that the most complex case is NP-hard, and maximally resilient graphs are trivially colorable. So finding the boundary where resilience makes things easy can shed new light on graph coloring.

Indeed, we argue in the paper that lots of important graphs have stronger resilience properties than one might expect. For example, here are the resilience properties of some famous graphs.

From left to right: the Petersen graph, 2-resiliently 3-colorable; the Dürer graph, 4-resiliently 4-colorable; the Grötzsch graph, 4-resiliently 4-colorable; and the Chvátal graph, 3-resiliently 4-colorable. These are all maximally resilient (no graph is more resilient than stated) and chromatic (no graph is colorable with fewer colors)

If I were of a mind to do applied graph theory, I would love to know about the resilience properties of graphs that occur in the wild. For example, the reader probably knows the problem of register allocation is a natural graph coloring problem. I would love to know the resilience properties of such graphs, with the dream that they might be resilient enough on average to admit efficient coloring algorithms.

Unfortunately the only way that I know how to compute resilience properties is via brute-force search, and of course this only works for small graphs and small $k$. If readers are interested I could post such a program (I wrote it in vanilla python), but for now I’ll just post a table I computed on the proportion of small graphs that have various levels of resilience (note this includes graphs that vacuously satisfy the definition).

Percentage of k-colorable graphs on 6 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       58.0    22.7     5.9     1.7
4       93.3    79.3    58.0    35.3
5       99.4    98.1    94.8    89.0
6      100.0   100.0   100.0   100.0

Percentage of k-colorable graphs on 7 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       38.1     8.2     1.2     0.3
4       86.7    62.6    35.0    14.9
5       98.7    95.6    88.5    76.2
6       99.9    99.7    99.2    98.3

Percentage of k-colorable graphs on 8 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       21.3     2.1     0.2     0.0
4       77.6    44.2    17.0     4.5

The idea is this: if this trend continues, that only some small fraction of all 3-colorable graphs are, say, 2-resiliently 3-colorable graphs, then it should be easy to color them. Why? Because resilience imposes structure on the graphs, and that structure can hopefully be realized in a way that allows us to color easily. We don’t know how to characterize that structure yet, but we can give some structural implications for sufficiently resilient graphs.

For example, a 7-resiliently 5-colorable graph can’t have any subgraphs on 6 vertices with $\binom{6}{2} - 7$ edges, or else we can add enough edges to get a 6-clique which isn’t 5-colorable. This gives an obvious general property about the sizes of subgraphs in resilient graphs, but as a more concrete instance let’s think about 2-resilient 3-colorable graphs $G$. This property says that no set of 4 vertices may have more than $4 = \binom{4}{2} - 2$ edges in $G$. This rules out 4-cycles and non-isolated triangles, but is it enough to make 3-coloring easy? We can say that $G$ is a triangle-free graph and a bunch of disjoint triangles, but it’s known 3-colorable non-planar triangle-free graphs can have arbitrarily large chromatic number, and so the coloring problem is hard. Moreover, 2-resilience isn’t enough to make $G$ planar. It’s not hard to construct a non-planar counterexample, but proving it’s 2-resilient is a tedious task I relegated to my computer.

Speaking of which, the problem of how to determine whether a $k$-colorable graph is $r$-resiliently $k$-colorable is open. Is this problem even in NP? It certainly seems not to be, but if it had a nice characterization or even stronger necessary conditions than above, we might be able to use them to find efficient coloring algorithms.

In our paper we begin to fill in a table whose completion would characterize the NP-hardness of coloring resilient graphs

The known complexity of k-coloring r-resiliently k-colorable graphs

Ignoring the technical notion of 2-to-1 hardness, the paper accomplishes this as follows. First, we prove some relationships between cells. In particular, if a cell is NP-hard then so are all the cells to the left and below it. So our Theorem 1, that 3-coloring 1-resiliently 3-colorable graphs is NP-hard, gives us the entire black region, though more trivial arguments give all except the (3,1) cell. Also, if a cell is in P (it’s easy to $k$-color graphs with that resilience), then so are all cells above and to its right. We prove that $k$-coloring $\binom{k}{2}$-resiliently $k$-colorable graphs is easy. This is trivial: no vertex may have degree greater than $k-1$, and the greedy algorithm can color such graphs with $k$ colors. So that gives us the entire light gray region.

There is one additional lower bound comes from the fact that it’s NP-hard to $2^{\sqrt[3]{k}}$-color a $k$-colorable graph. In particular, we prove that if you have any function $f(k)$ that makes it NP-hard to $f(k)$-color a $k$-colorable graph, then it is NP-hard to $f(k)$-color an $(f(k) - k)$-resiliently $f(k)$-colorable graph. The exponential lower bound hence gives us a nice linear lower bound, and so we have the following “sufficiently zoomed out” picture of the table

The zoomed out version of the classification table above.

The paper contains the details of how these observations are proved, in addition to the NP-hardness proof for 1-resiliently 3-colorable graphs. This leaves the following open problems:

• Get an unconditional, concrete linear resilience lower bound for hardness.
• Find an algorithm that colors graphs that are less resilient than $O(k^2)$. Even determining specific cells like (4,5) or (5,9) would likely give enough insight for this.
• Classify the tantalizing (3,2) cell (determine if it’s hard or easy to 3-color a 2-resiliently 3-colorable graph) or even better the (4,2) cell.
• Find a way to relate resilient coloring back to general coloring. For example, if such and such cell is hard, then you can’t approximate k-coloring to within so many colors.

## But Wait, There’s More!

Though this paper focuses on graph coloring, our idea of resilience doesn’t stop there (and this is one reason I like it so much!). One can imagine a notion of resilience for almost any combinatorial problem. If you’re trying to satisfy boolean formulas, you can define resilience to mean that you fix the truth value of some variable (we do this in the paper to build up to our main NP-hardness result of 3-coloring 1-resiliently 3-colorable graphs). You can define resilient set cover to allow the removal of some sets. And any other sort of graph-based problem (Traveling salesman, max cut, etc) can be resiliencified by adding or removing edges, whichever makes the problem more constrained.

So this resilience notion is quite general, though it’s hard to define precisely in a general fashion. There is a general framework called Constraint Satisfaction Problems (CSPs), but resilience here seem too general. A CSP is literally just a bunch of objects which can be assigned some set of values, and a set of constraints (k-ary 0-1-valued functions) that need to all be true for the problem to succeed. If we were to define resilience by “adding any constraint” to a given CSP, then there’s nothing to stop us from adding the negation of an existing constraint (or even the tautologically unsatisfiable constraint!). This kind of resilience would be a vacuous definition, and even if we try to rule out these edge cases, I can imagine plenty of weird things that might happen in their stead. That doesn’t mean there isn’t a nice way to generalize resilience to CSPs, but it would probably involve some sort of “constraint class” of acceptable constraints, and I don’t know a reasonable property to impose on the constraint class to make things work.

So there’s lots of room for future work here. It’s exciting to think where it will take me.

Until then!

# RealityMining, a Case Study in the Woes of Data Processing

This post is intended to be a tutorial on how to access the RealityMining dataset using Python (because who likes Matlab?), and a rant on how annoying the process was to figure out.

RealityMining is a dataset of smart-phone data logs from a group of about one hundred MIT students over the course of a year. The data includes communication and cell tower data, the latter being recorded every time a signal changes from one tower to the next. They also conducted a survey of perceived friendship, personal attributes, and a few other things. See the website for more information about what data is included. Note that the website lies: there is no lat/long location data. This is but the first sign that this is going to be a rough road.

In order to access the data yourself, you must fill out a form on the RealityMining website and wait for an email, but I’ve found their turnaround time to giving a download link is quite quick. The data comes in the form of a .mat file, “realitymining.mat”, which is a proprietary Matlab data format (warning sign number two, and largely the cause of all my problems).

Before we get started let me state my goals: to extract the person-to-person voice call data, actual proximity data (whether two subjects were connected to the same cell tower at the same time), and perceived friendship/proximity data. The resulting program is free to use and modify, and is posted on this blog’s Github page.

Once I found out that the Python scipy library has a function to load the Matlab data into Python, I was hopeful that things would be nice and user friendly. Boy was I wrong.

To load the matlab data into Python,

import scipy.io

matlab_filename = ...


Then you get a variable matlab_obj which, if you try to print it stalls the python process and you have to frantically hit the kill command to wrestle back control.

Going back a second time and being more cautious, we might inspect the type of the imported object:

>>> type(matlab_obj)
<class 'dict'>


Okay so it’s a dictionary, probably why it tried to print out its entire 2GB contents when I printed it. What are it’s keys?

>>> matlab_obj.keys()
dict_keys(['network', '__header__', '__version__', '__globals__', 's'])


So far so good (the only ones that matter are ‘network’ and ‘s’). So what is ‘s’?

>>> type(matlab_obj['s'])
<class 'numpy.ndarray'>
>>> len(matlab_obj['s'])
1


Okay it’s an array of length 1… What’s its first element?

>>> type(matlab_obj['s'][0])
<class 'numpy.ndarray'>
>>> len(matlab_obj['s'][0])
106


Okay so ‘s’ is an array of one element containing the array of actual data (I remember the study has 106 participants). Now let’s look at the first participant.

>>> type(matlab_obj['s'][0][0])
<class 'numpy.void'>


Okay… what’s a numpy.void? Searching Google for “What’s a numpy.void” gives little, but combing through Stackoverflow answers gives the following:

From what I understand, void is sort of like a Python list since it can hold objects of different data types, which makes sense since the columns in a structured array can be different data types.

So basically, numpy arrays that aren’t homogeneous are ‘voids’ and you can’t tell any information about the values. Great. Well I suppose we could just try *printing* out the value, at the cost of potentially having to kill the process and restart.

>>> matlab_obj['s'][0][0]
([], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [], [[]], [[0]], [], [[0]], [], [[0]], [], [], [], [], [], [], [], [], [], [], [], [], [[413785662906.0]], [], [], [[]], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [])


Okay… so the good news is we didn’t die, but the bad news is that there is almost no data, and there’s no way to tell what the data represents!

As a last straw let’s try printing out the second user.

>>> matlab_obj['s'][0][1]
([[(array([[ 732338.86915509]]), array([[354]], dtype=uint16), array([[-1]], dtype=int16), array(['Packet Data'],
dtype='&lt;U11'), array(['Outgoing'],
dtype='&lt;U8'), array([[0]], dtype=uint8),
...
* a thousand or so lines later *
...
[2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [0, 0], [2, 0], [3, 0], [3, 0], [3, 0], [3, 0], [3, 0]])


So this one has data, but still we have the problem that we have no idea what the various pieces of data represent! We could guess and check, but come on we’re in the 21st century.

## Hopelessness, and then Hope

At this point I gave up on a Python approach. As far as I could tell, the data was in a completely unusable format. My only recourse was to spend two weeks obtaining a copy of Matlab from my institution, spend an hour installing it, and then try to learn enough Matlab to parse out the subset of data I wanted into a universal (plaintext) format, so I could then read it back into Python and analyze it.

Two days after obtaining Matlab, I was referred to a blog post by Conrad Lee who provided the source code he used to parse the data in Python. His code was difficult to read and didn’t work out of the box, so I had to disassemble his solution and craft my own. This and a tutorial I later found, and enough elbow grease, were enough to help me figure things out.

The first big insight was that a numpy.void also acts like a dictionary! There’s no way to tell this programmatically, no keys() attribute on a numpy.void, so apparently it’s just folklore and bad documentation. To get the fields of a subject, we have to know its key (the RealityMining documentation tells you this, with enough typos to keep you guessing).

>>> subjects = matlab_obj['s'][0]
>>> subjects[77]['my_hashedNumber']
array([[78]], dtype=uint8)
>>> subjects[76]['mac']
array([[ 6.19653599e+10]])
>>> subjects[76]['mac'][0][0]
61965359909.0


The ‘mac’ and ‘my_hashedNumber’ are supposed to be identifiers for the subjects. It’s not clear why they have two such identifiers, but what’s worse is that not all participants in the study (not all subjects in the array of subjects, at least) have mac numbers or hashedNumbers. So in order to do anything I had to extract the subset of valid subjects.

def validSubjects(allSubjects):
return [s for s in allSubjects if hasNumeric(s,'mac') and hasNumeric(s,'my_hashedNumber')]


I define the hasNumeric function in the next section. But before that I wanted to keep track of all the stupid ids and produce my own id (contiguous, starting from zero, for valid subjects only) that will replace me having to deal with the stupid matlab subject array. So I did.

def idDicts(subjects):
return (dict((i, s) for (i,s) in enumerate(subjects)),
dict((getNumeric(s,'mac'), (i, s)) for (i,s) in enumerate(subjects)),
dict((getNumeric(s, 'my_hashedNumber'), (i, s)) for (i,s) in enumerate(subjects)))


Given a list of subjects (the valid ones), we create three dictionaries whose keys are the various identifiers, and whose values are the subject objects (and my own id). So we’ll pass around these dictionaries in place of the subject array.

## Extracting Communication Events

The important data here is the set of person-to-person phone calls made over the course of the study. But before we get there we need to inspect how the data is stored more closely. Before I knew about what follows, I was getting a lot of confusing index errors.

Note that the numpy object that gets returned from scipy’s loadmat function stores single-type data differently from compound data, and what’s worse is that it represents a string (which in Python should be a single-type piece of data) as a compound piece of data!

So I wrote helper functions to keep track of checking/accessing these fields. I feel like such a dirty programmer writing code like what follows, but I can still blame everyone else.

Single-type data (usually numeric):

def hasNumeric(obj, field):
try:
obj[field][0][0]
return True
except:
return False

def getNumeric(obj, field):
return obj[field][0][0]


Compound-type data (usually Arrays, but also dicts and strings):

def hasArray(obj, field):
try:
obj[field][0]
return True
except:
return False

def getArray(obj, field):
return obj[field][0]


Note the difference is for single indexing (using [0]) or multiple (using [0][0]). It occurred to me much later that there is an option to loadmat that may get rid of some or all of this doubly-nested array crap, but by then I was too lazy to change everything; it would undoubtedly be the case that some things would break and other wouldn’t (and even if that’s not the case, you can’t blame me for assuming it). I will never understand why the scipy folks chose this representation as the default.

This issue crops up when inspecting a single comm event:

>>> subjects[76]['comm'][0][0]
([[732238.1947106482]], [[0]], [[-1]], ['Voice call'], ['Outgoing'], [[16]], [[4]])


Again this record is a numpy.void with folklore keys, but notice how numeric values are doubly nested in lists while string values are not. Makes me want to puke.

Communication events can be “Incoming,” “Outgoing,” or “Missed,” although the last of these does not exist in the documentation. Not all fields exist in every record. Some are empty lists and some are double arrays containing nan, but you can check by hand that everyone that has a nonempty list of comm events has the needed fields as well:

>>> len([x for x in subjects if len(x['comm']) > 0 and len(x['comm'][0]) > 0])
81
>>> len([x for x in subjects if len(x['comm']) > 0 and len(x['comm'][0]) > 0 and
... all([hasArrayField(e, 'description') for e in x['comm'][0]])])
81


The exception is the duration field. If the duration field does not exist (if the field is an empty list), then it definitely corresponds to the “Missed” direction field.

It could otherwise be that the duration field is zero but the call was not labeled “Missed.” This might happen because they truncate the seconds, and the call lated less than one second (dubious) or because the call was also missed, or rejected without being missed. The documentation doesn’t say. Moreover, “Missed” doesn’t specify if the missed call was incoming or outgoing. So incorporating that data would ruin any chance of having a directed graph representation.

One possibility is that the outgoing calls with a duration of 0 are missed outgoing calls, and incoming calls always have positive duration, but actually there are examples of incoming calls of zero duration. Madness!

After sorting all this out (really, guessing), I was finally able to write a program that extracts the person-to-person call records in a specified date range.

First, to extract all communication events:

def allCommEvents(idDictionary):
events = []
for subjectId, subject in idDictionary.items():
if hasArray(subject, 'comm'):
events.extend([(subjectId, event) for event in getArray(subject, 'comm')])

return events


Then to extract that subset of communication events that were actually phone calls between users within the study:

def callsWithinStudy(commEvents, hashNumDict):
return [(subjectId, e) for (subjectId, e) in commEvents
if (getArray(e, 'description') == "Voice call" and
getNumeric(e, 'hashNum') in hashNumDict)]


Now we can convert the calls into a nicer format (a Python dictionary) and apply some business rules to deal with the ambiguities in the data.

def processCallEvents(callEvents, hashNumDict):
processedCallEvents = []

for subjectId, event in callEvents:
direction = getArray(event, 'direction')
duration = 0 if direction == 'Missed' else getNumeric(event, 'duration')
date = convertDatetime(getNumeric(event, 'date'))
hashNum = getNumeric(event, 'hashNum')
otherPartyId = hashNumDict[hashNum][0]

eventAsDict = {'subjectId': subjectId,
'direction': direction,
'duration': duration,
'otherPartyId': otherPartyId,
'date': date}
processedCallEvents.append(eventAsDict)

print("%d call event dictionaries" % len(processedCallEvents))
return processedCallEvents


All I needed to do was filter them by time now (implement the convertDateTime function used above)…

## Converting datetimes from Matlab to Python

Matlab timestamps are floats like

834758.1231233

The integer part of the float represents the number of days since Jan 1, 0AD. The fractional part represents a fraction of a day, down to seconds I believe. To convert from the matlab Format to the Python format, use

def convertDatetime(dt):
return datetime.fromordinal(int(dt)) + timedelta(days=dt%1) - timedelta(days=366) - timedelta(hours=5)


This is taken without permission from Conrad’s blog (I doubt he’ll care), and he documents his reasoning in a side-post.

This then allows one to filter a list of communication events by time:

def inRange(dateRange, timevalue):
start, end = dateRange
unixTime = int(time.mktime(timevalue.timetuple()))
return start <= unixTime <= end

def filterByDate(dateRange, events):
return [e for e in events if inRange(dateRange, e['date'])]


Phew! Glad I didn’t have to go through that hassle on my own.

## Extracting Survey Data

The dataset includes a survey about friendship and proximity among the participants. That means we have data on

1. Who considers each other in the same “close group of friends.”
2. An estimate on how much time at work each person spends with each other person.
3. An estimate on how much time outside work each person spends with each other person.

This information comes in the form of three lists. These lists sit inside a different numpy.void sitting in the original matlab_object from the beginning of the post.

matlab_obj = scipy.io.loadmat(matlab_filename)
matlab_obj['network'][0][0]


Now the access rules (getNumeric, getArray) we used for the matlab_obj[‘s’] matrix don’t necessarily apply to the values in the ‘network’ matrix. The friendship survey is accessed by

friends = network['friends']


and it’s an array, each element of which is a participant’s response to the survey (claimed to be stored as a 0-1 array). So, for example, to get the number corresponding to whether id1 considers id2 a close friend, you would use

friends = network['friends'][id1][id2]


This is all straightforward, but there are two complications (stupid trivialities, really). The first is that the matrix is not 0-1 valued, but (nan-1.0)-valued. That’s right, if the individual didn’t consider someone a friend, the record for that is numpy’s nan type, and if they did it’s the float 1.0. So you need to either replace all nan’s with zeros ahead of time, or take account for that in any computations you do with the data.

I don’t know whether this is caused by scipy’s bad Matlab data import or the people producing the data, but it’s a clear reason to not use proprietary file formats to distribute data.

The second problem is that the survey is stored as an array, where the index of the array corresponds to the person the respondent is evaluating for friendship. The problem with this is that not all people are in the survey (for whatever reason, people dropping out or just bad study design), so they include a separate array in the network object called “sub_sort” (a totally unhelpful name). Sub_sort contains in its $i$-th position the hashNum (of the user from the ‘s’ matrix) of “survey-user” $i$ (that is the user whose answer is represented in index $i$ of the network array).

This is incredibly confusing, so let’s explain it with an example. Say that we wanted to parse user 4’s friendship considerations. We’d first access his friends list by

friends = network['friends'][4]


Then, to see whether he considers user 7 as a friend, we’d have to look up user 7 in the sub_sort array (which requries a search through said array) and then use the index of user 7 in that array to index the friends list appropriately. So

subsortArray = network['sub_sort'][0]
user7sIndex = subsortArray.indexof(7)
network['friends'][4][user7sIndex]


The reason this is such terrible design is that an array is the wrong data structure to use. They should have used a list of dictionaries or a dictionary with ordered pairs of hashNums for keys. Yet another reason that matlab (a language designed for numerical matrix manipulation) is an abysmal tool for the job.

The other two survey items are surveys on proximity data. Specifically, they ask the respondent to report how much time per day he or she spends within 10 feet of each other individual in the study.

inLabProximity = network['lab']
outLabProximity = network['outlab']


The values in these matrices are accessed in exactly the same way, with the values being in the set {nan, 1.0, 2.0, 3.0, 4.0, 5.0}. These numbers correspond to various time ranges (the documentation is a bit awkward about this too, using “at least” in place of complete time ranges), but fine. It’s no different from the friends matrix, so at least it’s consistent.

## Cell Tower Data

Having learned all of the gotchas I did, I processed the cell tower data with little fuss. The only bothersome bit was that they provide two “parallel” arrays, one called “locs” containing pairs of (date, cell tower id), and a second called “loc_ids” containing numbers corresponding to a “second” set of simplified cell tower ids. These records correspond to times when a user entered into a region covered by a specific tower, and so it gives very (coarse location) data. It’s not the lat/long values promised by the website, and even though the ids for the cell towers are governed by a national standard, the lat/long values for the towers are not publicly or freely available in database form.

Why they need two sets of cell tower ids is beyond me, but the bothersome part is that they didn’t include the dates in the second array. So you must assume the two arrays are in sync (that the events correspond elementwise by date), and if you want to use the simplified ids you still have to access the “locs” array for the dates (because what use would the tower ids be without dates?).

One can access these fields by ignoring our established rules for accessing subject fields (so inconsistent!). And so to access the parallel arrays in a reasonable way, I used the following:

towerEvents = list(zip(subject['locs'], subject['loc_ids']))


Again the datetimes are in matlab format, so the date conversion function above is used to convert them to python datetimes.

Update: As I’ve continued to work with this data set, I’ve found out that, once again contrary to the documentation, the tower id 1 corresponds to being out of signal. The documentation claims that tower id 0 implies no signal, and this is true of the “locs” array, but for whatever reason they decided to change it to a 1 in the “loc_ids” array. What’s confusing is that there’s no mention of the connection. You have to guess it and check that the lengths of the corresponding lists of events are equal (which is by no means proof that the towers ids are the same, but good evidence).

The documentation is also unclear as to whether the (datestamp, tower id) pair represents the time when the user started using that tower, or the time when the user finished using that tower. As such, I must guess that it’s the former at the risk of producing bad data (and hence bad research).

## Conclusions

I love free data sets. I really do, and I appreciate all the work that researchers put into making and providing their data. But all the time I encounter the most horrible, terrible, no good, very bad data sets! RealityMining is only one small example, and it was actually nice because I was able to figure things out eventually. There are other datasets that have frustrated me much more with inconsistencies and complete absence of documentation.

Is it really that hard to make data useable? Here are the reasons why this was such an annoying and frustrating task:

1. The data was stored in a proprietary format. Please please please release your data in plaintext files!
2. The documentation contradicted the data it was documenting, and was altogether unclear. (One contradiction is too often. This was egregious.)
3. There was even inconsistency within the data. Sometimes missing values were empty lists, sometimes numpy.nan, and sometimes simply missing.
4. Scipy has awful defaults for the loadmat function.
5. Numpy data structures are apparently designed for opacity unless you know the folklore and spend hours poring over documentation.

It seems that most of this could be avoided by fixing 1 and 2, since then it’s a problem that an automated tool like Google Refine can fix (though I admit to never having tried Refine, because it crashed on the only dataset I’ve ever given it). I know scientists have more important things to worry about, and I could just blame it on the 2004 origin of the data set, but in the age of reproducibility and big data (and seeing that this data set is widely used in social network research), it’s just as important to release the data as it is to make the data fit for consumption. RealityMining is like a math paper with complicated terms defined only by example, and theorems stated without proof that are actually false. It doesn’t uphold the quality that science must, it makes me question the choices made by other researchers who have had to deal with the dataset, and it wastes everyone else’s time.

In this respect, I’ve decided to post my Python code in a github repository, as is usual on this blog. You’ll have to get the actual RealityMining dataset yourself if you want to use it, but once you have it you can run my program (it depends only on the numpy/scipy stack and the standard library; tested in Python 3.3.3) to get the data I’m interested in or modify it for your own purposes.

Moral of the story: For the sake of science, please don’t distribute a database as a Matlab matrix! Or anything else for that matter!