RealityMining, a Case Study in the Woes of Data Processing

This post is intended to be a tutorial on how to access the RealityMining dataset using Python (because who likes Matlab?), and a rant on how annoying the process was to figure out.

RealityMining is a dataset of smart-phone data logs from a group of about one hundred MIT students over the course of a year. The data includes communication and cell tower data, the latter being recorded every time a signal changes from one tower to the next. They also conducted a survey of perceived friendship, personal attributes, and a few other things. See the website for more information about what data is included. Note that the website lies: there is no lat/long location data. This is but the first sign that this is going to be a rough road.

In order to access the data yourself, you must fill out a form on the RealityMining website and wait for an email, but I’ve found their turnaround time to giving a download link is quite quick. The data comes in the form of a .mat file, “realitymining.mat”, which is a proprietary Matlab data format (warning sign number two, and largely the cause of all my problems).

Before we get started let me state my goals: to extract the person-to-person voice call data, actual proximity data (whether two subjects were connected to the same cell tower at the same time), and perceived friendship/proximity data. The resulting program is free to use and modify, and is posted on this blog’s Github page.

Loading realitymining.mat into Python

Once I found out that the Python scipy library has a function to load the Matlab data into Python, I was hopeful that things would be nice and user friendly. Boy was I wrong.

To load the matlab data into Python,

import scipy.io

matlab_filename = ...
matlab_obj = scipy.io.loadmat(matlab_filename)

Then you get a variable matlab_obj which, if you try to print it stalls the python process and you have to frantically hit the kill command to wrestle back control.

Going back a second time and being more cautious, we might inspect the type of the imported object:

>>> type(matlab_obj)
<class 'dict'>

Okay so it’s a dictionary, probably why it tried to print out its entire 2GB contents when I printed it. What are it’s keys?

>>> matlab_obj.keys()
dict_keys(['network', '__header__', '__version__', '__globals__', 's'])

So far so good (the only ones that matter are ‘network’ and ‘s’). So what is ‘s’?

>>> type(matlab_obj['s'])
<class 'numpy.ndarray'>
>>> len(matlab_obj['s'])
1

Okay it’s an array of length 1… What’s its first element?

>>> type(matlab_obj['s'][0])
<class 'numpy.ndarray'>
>>> len(matlab_obj['s'][0])
106

Okay so ‘s’ is an array of one element containing the array of actual data (I remember the study has 106 participants). Now let’s look at the first participant.

>>> type(matlab_obj['s'][0][0])
<class 'numpy.void'>

Okay… what’s a numpy.void? Searching Google for “What’s a numpy.void” gives little, but combing through Stackoverflow answers gives the following:

From what I understand, void is sort of like a Python list since it can hold objects of different data types, which makes sense since the columns in a structured array can be different data types.

So basically, numpy arrays that aren’t homogeneous are ‘voids’ and you can’t tell any information about the values. Great. Well I suppose we could just try *printing* out the value, at the cost of potentially having to kill the process and restart.

>>> matlab_obj['s'][0][0]
([], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [], [[]], [[0]], [], [[0]], [], [[0]], [], [], [], [], [], [], [], [], [], [], [], [], [[413785662906.0]], [], [], [[]], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [])

Okay… so the good news is we didn’t die, but the bad news is that there is almost no data, and there’s no way to tell what the data represents!

As a last straw let’s try printing out the second user.

>>> matlab_obj['s'][0][1]
([[(array([[ 732338.86915509]]), array([[354]], dtype=uint16), array([[-1]], dtype=int16), array(['Packet Data'],
dtype='<U11'), array(['Outgoing'],
dtype='<U8'), array([[0]], dtype=uint8),
...
* a thousand or so lines later *
...
[2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [2, 0], [0, 0], [2, 0], [3, 0], [3, 0], [3, 0], [3, 0], [3, 0]])

So this one has data, but still we have the problem that we have no idea what the various pieces of data represent! We could guess and check, but come on we’re in the 21st century.

Hopelessness, and then Hope

At this point I gave up on a Python approach. As far as I could tell, the data was in a completely unusable format. My only recourse was to spend two weeks obtaining a copy of Matlab from my institution, spend an hour installing it, and then try to learn enough Matlab to parse out the subset of data I wanted into a universal (plaintext) format, so I could then read it back into Python and analyze it.

Two days after obtaining Matlab, I was referred to a blog post by Conrad Lee who provided the source code he used to parse the data in Python. His code was difficult to read and didn’t work out of the box, so I had to disassemble his solution and craft my own. This and a tutorial I later found, and enough elbow grease, were enough to help me figure things out.

The first big insight was that a numpy.void also acts like a dictionary! There’s no way to tell this programmatically, no keys() attribute on a numpy.void, so apparently it’s just folklore and bad documentation. To get the fields of a subject, we have to know its key (the RealityMining documentation tells you this, with enough typos to keep you guessing).

>>> subjects = matlab_obj['s'][0]
>>> subjects[77]['my_hashedNumber']
array([[78]], dtype=uint8)
>>> subjects[76]['mac']
array([[ 6.19653599e+10]])
>>> subjects[76]['mac'][0][0]
61965359909.0

The ‘mac’ and ‘my_hashedNumber’ are supposed to be identifiers for the subjects. It’s not clear why they have two such identifiers, but what’s worse is that not all participants in the study (not all subjects in the array of subjects, at least) have mac numbers or hashedNumbers. So in order to do anything I had to extract the subset of valid subjects.

def validSubjects(allSubjects):
   return [s for s in allSubjects if hasNumeric(s,'mac') and hasNumeric(s,'my_hashedNumber')]

I define the hasNumeric function in the next section. But before that I wanted to keep track of all the stupid ids and produce my own id (contiguous, starting from zero, for valid subjects only) that will replace me having to deal with the stupid matlab subject array. So I did.

def idDicts(subjects):
   return (dict((i, s) for (i,s) in enumerate(subjects)),
      dict((getNumeric(s,'mac'), (i, s)) for (i,s) in enumerate(subjects)),
      dict((getNumeric(s, 'my_hashedNumber'), (i, s)) for (i,s) in enumerate(subjects)))

Given a list of subjects (the valid ones), we create three dictionaries whose keys are the various identifiers, and whose values are the subject objects (and my own id). So we’ll pass around these dictionaries in place of the subject array.

Extracting Communication Events

The important data here is the set of person-to-person phone calls made over the course of the study. But before we get there we need to inspect how the data is stored more closely. Before I knew about what follows, I was getting a lot of confusing index errors.

Note that the numpy object that gets returned from scipy’s loadmat function stores single-type data differently from compound data, and what’s worse is that it represents a string (which in Python should be a single-type piece of data) as a compound piece of data!

So I wrote helper functions to keep track of checking/accessing these fields. I feel like such a dirty programmer writing code like what follows, but I can still blame everyone else.

Single-type data (usually numeric):

def hasNumeric(obj, field):
   try:
      obj[field][0][0]
      return True
   except:
      return False

def getNumeric(obj, field):
   return obj[field][0][0]

Compound-type data (usually Arrays, but also dicts and strings):

def hasArray(obj, field):
   try:
      obj[field][0]
      return True
   except:
      return False

def getArray(obj, field):
   return obj[field][0]

Note the difference is for single indexing (using [0]) or multiple (using [0][0]). It occurred to me much later that there is an option to loadmat that may get rid of some or all of this doubly-nested array crap, but by then I was too lazy to change everything; it would undoubtedly be the case that some things would break and other wouldn’t (and even if that’s not the case, you can’t blame me for assuming it). I will never understand why the scipy folks chose this representation as the default.

This issue crops up when inspecting a single comm event:

>>> subjects[76]['comm'][0][0]
([[732238.1947106482]], [[0]], [[-1]], ['Voice call'], ['Outgoing'], [[16]], [[4]])

Again this record is a numpy.void with folklore keys, but notice how numeric values are doubly nested in lists while string values are not. Makes me want to puke.

Communication events can be “Incoming,” “Outgoing,” or “Missed,” although the last of these does not exist in the documentation. Not all fields exist in every record. Some are empty lists and some are double arrays containing nan, but you can check by hand that everyone that has a nonempty list of comm events has the needed fields as well:

>>> len([x for x in subjects if len(x['comm']) > 0 and len(x['comm'][0]) > 0])
81
>>> len([x for x in subjects if len(x['comm']) > 0 and len(x['comm'][0]) > 0 and
... all([hasArrayField(e, 'description') for e in x['comm'][0]])])
81

The exception is the duration field. If the duration field does not exist (if the field is an empty list), then it definitely corresponds to the “Missed” direction field.

It could otherwise be that the duration field is zero but the call was not labeled “Missed.” This might happen because they truncate the seconds, and the call lated less than one second (dubious) or because the call was also missed, or rejected without being missed. The documentation doesn’t say. Moreover, “Missed” doesn’t specify if the missed call was incoming or outgoing. So incorporating that data would ruin any chance of having a directed graph representation.

One possibility is that the outgoing calls with a duration of 0 are missed outgoing calls, and incoming calls always have positive duration, but actually there are examples of incoming calls of zero duration. Madness!

After sorting all this out (really, guessing), I was finally able to write a program that extracts the person-to-person call records in a specified date range.

First, to extract all communication events:

def allCommEvents(idDictionary):
   events = []
   for subjectId, subject in idDictionary.items():
      if hasArray(subject, 'comm'):
         events.extend([(subjectId, event) for event in getArray(subject, 'comm')])

   return events

Then to extract that subset of communication events that were actually phone calls between users within the study:

def callsWithinStudy(commEvents, hashNumDict):
   return [(subjectId, e) for (subjectId, e) in commEvents
           if (getArray(e, 'description') == "Voice call" and
              getNumeric(e, 'hashNum') in hashNumDict)]

Now we can convert the calls into a nicer format (a Python dictionary) and apply some business rules to deal with the ambiguities in the data.

def processCallEvents(callEvents, hashNumDict):
   processedCallEvents = []

   for subjectId, event in callEvents:
      direction = getArray(event, 'direction')
      duration = 0 if direction == 'Missed' else getNumeric(event, 'duration')
      date = convertDatetime(getNumeric(event, 'date'))
      hashNum = getNumeric(event, 'hashNum')
      otherPartyId = hashNumDict[hashNum][0]

      eventAsDict = {'subjectId': subjectId,
                      'direction': direction,
                      'duration': duration,
                      'otherPartyId': otherPartyId,
                      'date': date}
      processedCallEvents.append(eventAsDict)

   print("%d call event dictionaries" % len(processedCallEvents))
   return processedCallEvents

All I needed to do was filter them by time now (implement the `convertDateTime` function used above)…

Converting datetimes from Matlab to Python

Matlab timestamps are floats like

834758.1231233

The integer part of the float represents the number of days since Jan 1, 0AD. The fractional part represents a fraction of a day, down to seconds I believe. To convert from the matlab Format to the Python format, use

def convertDatetime(dt):
   return datetime.fromordinal(int(dt)) + timedelta(days=dt%1) - timedelta(days=366) - timedelta(hours=5)

This is taken without permission from Conrad’s blog (I doubt he’ll care), and he documents his reasoning in a side-post.

This then allows one to filter a list of communication events by time:

def inRange(dateRange, timevalue):
   start, end = dateRange
   unixTime = int(time.mktime(timevalue.timetuple()))
   return start <= unixTime <= end

def filterByDate(dateRange, events):
   return [e for e in events if inRange(dateRange, e['date'])]

Phew! Glad I didn’t have to go through that hassle on my own.

Extracting Survey Data

The dataset includes a survey about friendship and proximity among the participants. That means we have data on

  1. Who considers each other in the same “close group of friends.”
  2. An estimate on how much time at work each person spends with each other person.
  3. An estimate on how much time outside work each person spends with each other person.

This information comes in the form of three lists. These lists sit inside a different numpy.void sitting in the original matlab_object from the beginning of the post.

matlab_obj = scipy.io.loadmat(matlab_filename)
matlab_obj['network'][0][0]

Now the access rules (getNumeric, getArray) we used for the matlab_obj[‘s’] matrix don’t necessarily apply to the values in the ‘network’ matrix. The friendship survey is accessed by

friends = network['friends']

and it’s an array, each element of which is a participant’s response to the survey (claimed to be stored as a 0-1 array). So, for example, to get the number corresponding to whether id1 considers id2 a close friend, you would use

friends = network['friends'][id1][id2]

This is all straightforward, but there are two complications (stupid trivialities, really). The first is that the matrix is not 0-1 valued, but (nan-1.0)-valued. That’s right, if the individual didn’t consider someone a friend, the record for that is numpy’s nan type, and if they did it’s the float 1.0. So you need to either replace all nan’s with zeros ahead of time, or take account for that in any computations you do with the data.

I don’t know whether this is caused by scipy’s bad Matlab data import or the people producing the data, but it’s a clear reason to not use proprietary file formats to distribute data.

The second problem is that the survey is stored as an array, where the index of the array corresponds to the person the respondent is evaluating for friendship. The problem with this is that not all people are in the survey (for whatever reason, people dropping out or just bad study design), so they include a separate array in the network object called “sub_sort” (a totally unhelpful name). Sub_sort contains in its $i$-th position the hashNum (of the user from the ‘s’ matrix) of “survey-user” $i$ (that is the user whose answer is represented in index $ i$ of the network array).

This is incredibly confusing, so let’s explain it with an example. Say that we wanted to parse user 4’s friendship considerations. We’d first access his friends list by

friends = network['friends'][4]

Then, to see whether he considers user 7 as a friend, we’d have to look up user 7 in the sub_sort array (which requries a search through said array) and then use the index of user 7 in that array to index the friends list appropriately. So

subsortArray = network['sub_sort'][0]
user7sIndex = subsortArray.indexof(7)
network['friends'][4][user7sIndex]

The reason this is such terrible design is that an array is the wrong data structure to use. They should have used a list of dictionaries or a dictionary with ordered pairs of hashNums for keys. Yet another reason that matlab (a language designed for numerical matrix manipulation) is an abysmal tool for the job.

The other two survey items are surveys on proximity data. Specifically, they ask the respondent to report how much time per day he or she spends within 10 feet of each other individual in the study.

inLabProximity = network['lab']
outLabProximity = network['outlab']

The values in these matrices are accessed in exactly the same way, with the values being in the set {nan, 1.0, 2.0, 3.0, 4.0, 5.0}. These numbers correspond to various time ranges (the documentation is a bit awkward about this too, using “at least” in place of complete time ranges), but fine. It’s no different from the friends matrix, so at least it’s consistent.

Cell Tower Data

Having learned all of the gotchas I did, I processed the cell tower data with little fuss. The only bothersome bit was that they provide two “parallel” arrays, one called “locs” containing pairs of (date, cell tower id), and a second called “loc_ids” containing numbers corresponding to a “second” set of simplified cell tower ids. These records correspond to times when a user entered into a region covered by a specific tower, and so it gives very (coarse location) data. It’s not the lat/long values promised by the website, and even though the ids for the cell towers are governed by a national standard, the lat/long values for the towers are not publicly or freely available in database form.

Why they need two sets of cell tower ids is beyond me, but the bothersome part is that they didn’t include the dates in the second array. So you must assume the two arrays are in sync (that the events correspond elementwise by date), and if you want to use the simplified ids you still have to access the “locs” array for the dates (because what use would the tower ids be without dates?).

One can access these fields by ignoring our established rules for accessing subject fields (so inconsistent!). And so to access the parallel arrays in a reasonable way, I used the following:

towerEvents = list(zip(subject['locs'], subject['loc_ids']))

Again the datetimes are in matlab format, so the date conversion function above is used to convert them to python datetimes.

Update: As I’ve continued to work with this data set, I’ve found out that, once again contrary to the documentation, the tower id 1 corresponds to being out of signal. The documentation claims that tower id 0 implies no signal, and this is true of the “locs” array, but for whatever reason they decided to change it to a 1 in the “loc_ids” array. What’s confusing is that there’s no mention of the connection. You have to guess it and check that the lengths of the corresponding lists of events are equal (which is by no means proof that the towers ids are the same, but good evidence).

The documentation is also unclear as to whether the (datestamp, tower id) pair represents the time when the user started using that tower, or the time when the user finished using that tower. As such, I must guess that it’s the former at the risk of producing bad data (and hence bad research).

Conclusions

I love free data sets. I really do, and I appreciate all the work that researchers put into making and providing their data. But all the time I encounter the most horrible, terrible, no good, very bad data sets! RealityMining is only one small example, and it was actually nice because I was able to figure things out eventually. There are other datasets that have frustrated me much more with inconsistencies and complete absence of documentation.

Is it really that hard to make data useable? Here are the reasons why this was such an annoying and frustrating task:

  1. The data was stored in a proprietary format. Please please please release your data in plaintext files!
  2. The documentation contradicted the data it was documenting, and was altogether unclear. (One contradiction is too often. This was egregious.)
  3. There was even inconsistency within the data. Sometimes missing values were empty lists, sometimes numpy.nan, and sometimes simply missing.
  4. Scipy has awful defaults for the loadmat function.
  5. Numpy data structures are apparently designed for opacity unless you know the folklore and spend hours poring over documentation.

It seems that most of this could be avoided by fixing 1 and 2, since then it’s a problem that an automated tool like Google Refine can fix (though I admit to never having tried Refine, because it crashed on the only dataset I’ve ever given it). I know scientists have more important things to worry about, and I could just blame it on the 2004 origin of the data set, but in the age of reproducibility and big data (and seeing that this data set is widely used in social network research), it’s just as important to release the data as it is to make the data fit for consumption. RealityMining is like a math paper with complicated terms defined only by example, and theorems stated without proof that are actually false. It doesn’t uphold the quality that science must, it makes me question the choices made by other researchers who have had to deal with the dataset, and it wastes everyone else’s time.

In this respect, I’ve decided to post my Python code in a github repository, as is usual on this blog. You’ll have to get the actual RealityMining dataset yourself if you want to use it, but once you have it you can run my program (it depends only on the numpy/scipy stack and the standard library; tested in Python 3.3.3) to get the data I’m interested in or modify it for your own purposes.

Moral of the story: For the sake of science, please don’t distribute a database as a Matlab matrix! Or anything else for that matter!

The Erdős-Rényi Random Graph

During the 1950’s the famous mathematician Paul Erdős and Alfred Rényi put forth the concept of a random graph and in the subsequent years of study transformed the world of combinatorics. The random graph is the perfect example of a good mathematical definition: it’s simple, has surprisingly intricate structure, and yields many applications.

In this post we’ll explore basic facts about random graphs, slowly detail a proof on their applications to graph theory, and explore their more interesting properties computationally (a prelude to proofs about their structure). We assume the reader is familiar with the definition of a graph, which we’ve written about at length for non-mathematical audiences, and has some familiarity with undergraduate-level probability and combinatorics for the more mathematical sections of the post. We’ll do our best to remind the reader of these prerequisites as we go, and we welcome any clarification questions in the comment section.

The Erdős-Rényi Model

The definition of a random graph is simple enough that we need not defer it to the technical section of the article.

Definition: Given a positive integer $ n$ and a probability value $ 0 \leq p \leq 1$, define the graph $ G(n,p)$ to be the undirected graph on $ n$ vertices whose edges are chosen as follows. For all pairs of vertices $ v,w$ there is an edge $ (v,w)$ with probability $ p$.

Of course, there is no single random graph. What we’ve described here is a process for constructing a graph. We create a set of $ n$ vertices, and for each possible pair of vertices we flip a coin (often a biased coin) to determine if we should add an edge connecting them. Indeed, every graph can be made by this process if one is sufficiently lucky (or unlucky), but it’s very unlikely that we will have no edges at all if $ p$ is large enough. So $ G(n,p)$ is really a probability distribution over the set of all possible graphs on $ n$ vertices. We commit a horrendous abuse of notation by saying $ G$ or $ G(n,p)$ is a random graph instead of saying that $ G$ is sampled from the distribution. The reader will get used to it in time.

Why Do We Care?

Random graphs of all sorts (not just Erdős’s model) find applications in two very different worlds. The first is pure combinatorics, and the second is in the analysis of networks of all kinds.

In combinatorics we often wonder if graphs exist with certain properties. For instance, in graph theory we have the notion of graph colorability: can we color the vertices of a graph with $ k$ colors so that none of its edges are monochromatic? (See this blog’s primer on graph coloring for more) Indeed, coloring is known to be a very difficult problem on general graphs. The problem of determining whether a graph can be colored with a fixed number of colors has no known efficient algorithm; it is NP-complete. Even worse, much of our intuition about graphs fails for graph coloring. We would expect that sparse-looking graphs can be colored with fewer colors than dense graphs. One naive way to measure sparsity of a graph is to measure the length of its shortest cycle (recall that a cycle is a path which starts and ends at the same vertex). This measurement is called the girth of a graph. But Paul Erdős proved using random graphs, as we will momentarily, that for any choice of integers $ g,k$ there are graphs of girth $ \geq g$ which cannot be colored with fewer than $ k$ colors. Preventing short cycles in graphs doesn’t make coloring easier.

The role that random graphs play in this picture is to give us ways to ensure the existence of graphs with certain properties, even if we don’t know how to construct an example of such a graph. Indeed, for every theorem proved using random graphs, there is a theorem (or open problem) concerning how to algorithmically construct those graphs which are known to exist.

Pure combinatorics may not seem very useful to the real world, but models of random graphs (even those beyond the Erdős-Rényi model) are quite relevant. Here is a simple example. One can take a Facebook user $ u$ and form a graph of that users network of immediate friends $ N(u)$ (excluding $ u$ itself), where vertices are people and two people are connected by an edge if they are mutual friends; call this the user’s friendship neighborhood. It turns out that the characteristics of the average Facebook user’s friendship neighborhood resembles a random graph. So understanding random graphs helps us understand the structure of small networks of friends. If we’re particularly insightful, we can do quite useful things like identify anomalies, such as duplicitous accounts, which deviate quite far from the expected model. They can also help discover trends or identify characteristics that can allow for more accurate ad targeting. For more details on how such an idea is translated into mathematics and code, see Modularity (we plan to talk about modularity on this blog in the near future; lots of great linear algebra there!).

Random graphs, when they exhibit observed phenomena, have important philosophical consequences. From a bird’s-eye view, there are two camps of scientists. The first are those who care about leveraging empirically observed phenomena to solve problems. Many statisticians fit into this realm: they do wonderful magic with data fit to certain distributions, but they often don’t know and don’t care whether the data they use truly has their assumed properties. The other camp is those who want to discover generative models for the data with theoretical principles. This is more like theoretical physics, where we invent an arguably computational notion of gravity whose consequences explain our observations.

For applied purposes, the Erdős-Rényi random graph model is in the second camp. In particular, if something fits in the Erdős-Rényi model, then it’s both highly structured (as we will make clear in the sequel) and “essentially random.” Bringing back the example of Facebook, this says that most people in the average user’s immediate friendship neighborhood are essentially the same and essentially random in their friendships among the friends of $ u$. This is not quite correct, but it’s close enough to motivate our investigation of random graph models. Indeed, even Paul Erdős in his landmark paper mentioned that equiprobability among all vertices in a graph is unrealistic. See this survey for a more thorough (and advanced!) overview, and we promise to cover models which better represent social phenomena in the future.

So lets go ahead and look at a technical proof using random graphs from combinatorics, and then write some programs to generate random graphs.

Girth and Chromatic Number, and Counting Triangles

As a fair warning, this proof has a lot of moving parts. Skip to the next section if you’re eager to see some programs.

Say we have a $ k$ and a $ g$, and we wonder whether a graph can exist which simultaneously has no cycles of length less than $ g$ (the girth) and needs at least $ k$ colors to color. The following theorem settles this question affirmatively.  A bit of terminology: the chromatic number of a graph $ G$, denoted $ \chi(G)$, is the smallest number of colors needed to properly color $ G$.

Theorem: For any natural numbers $ k,g$, there exist graphs of chromatic number at least $ k$ and girth at least $ g$.

Proof. Taking our cue from random graphs, let’s see what the probability is that a random graph $ G(n,p)$ on $ n$ vertices will have our desired properties. Or easier, what’s the chance that it will not have the right properties? This is essentially a fancy counting argument, but it’s nicer if we phrase it in the language of probability theory.

The proof has a few twists and turns for those uninitiated to the probabilistic method of proof. First, we will look at an arbitrary $ G(n,p)$ (where $ n,p$ are variable) and ask two questions about it: what is the expected number of short cycles, and what is the expected “independence number” (which we will see is related to coloring). We’ll then pick a value of $ p$, depending crucially on $ n$, which makes both of these expectations small. Next, we’ll use the fact that if the probability that $ G(n,p)$ doesn’t have our properties is strictly less than 1, then there has to be some instance in our probability space which has those properties (if no instance had the property, then the probability would be one!). Though we will not know what the graphs look like, their existence is enough to prove the theorem.

So let’s start with cycles. If we’re given a desired girth of $ g$, the expected number of cycles of length $ \leq g$ in $ G(n,p)$ can be bounded by $ (np)^{g+1}/(np-1)$. To see this, the two main points are how to count the number of ways to pick $ k$ vertices in order to form a cycle, and a typical fact about sums of powers. Indeed, we can think of a cycle of length $ k$ as a way to seat a choice of $ k$ people around a circular table. There are $ \binom{n}{k}$ possible groups of people, and $ (k-1)!$ ways to seat one group. If we fix $ k$ and let $ n$ grow, then the product $ \binom{n}{k}(k-1)!$ will eventually be smaller than $ n^k$ (actually, this happens almost immediately in almost all cases). For each such choice of an ordering, the probability of the needed edges occurring to form a cycle is $ p^j$, since all edges must occur independently of each other with probability $ p$.

So the probability that we get a cycle of $ j$ vertices is

$ \displaystyle \binom{n}{j}(j-1)!p^j$

And by the reasoning above we can bound this by $ n^jp^j$. Summing over all numbers $ j = 3, \dots, g$ (we are secretly using the union bound), we bound the expected number of cycles of length $ \leq g$ from above:

$ \displaystyle \sum_{j=3}^g n^j p^j < \sum_{j=0}^g n^j p^j = \frac{(np)^{g+1}}{np – 1}$

Since we want relatively few cycles to occur, we want it to be the case that the last quantity, $ (np)^{j+1}/(np-1)$, goes to zero as $ n$ goes to infinity. One trick is to pick $ p$ depending on $ n$. If $ p = n^l$, our upper bound becomes $ n^{(l+1)(g+1)} / (n^{1+l} – 1)$, and if we want the quantity to tend to zero it must be that $ (l+1)(g+1) < 1$. Solving this we get that $ -1 < l < \frac{1}{g+1} – 1 < 0$. Pick such an $ l$ (it doesn’t matter which), and keep this in mind: for our choice of $ p$, the expected number of cycles goes to zero as $ n$ tends to infinity.

On the other hand, we want to make sure that such a graph has high chromatic number. To do this we’ll look at a related property: the size of the largest independent set. An independent set of a graph $ G = (V,E)$ is a set of vertices $ S \subset V$ so that there are no edges in $ E$ between vertices of $ S$. We call $ \alpha(G)$ the size of the largest independent set. The values $ \alpha(G)$ and $ \chi(G)$ are related, because any time you have an independent set you can color all the vertices with a single color. In particular, this proves the inequality $ \chi(G) \alpha(G) \geq n$, the number of vertices of $ G$, or equivalently $ \chi(G) \geq n / \alpha(G)$. So if we want to ensure $ \chi(G)$ is large, it suffices to show $ \alpha(G)$ is small (rigorously, $ \alpha(G) \leq n / k$ implies $ \chi(G) \geq k$).

The expected number of independent sets (again using the union bound) is at most the product of the number of possible independent sets and the probability of one of these having no edges. We let $ r$ be arbitrary and look at independent sets of size $ r$ Since there are $ \binom{n}{r}$ sets and each has a probability $ (1-p)^{\binom{r}{2}}$ of being independent, we get the probability that there is an independent set of size $ r$ is bounded by

$ \displaystyle \textup{P}(\alpha(G) \geq r) \leq \binom{n}{r}(1-p)^{\binom{r}{2}}$

We use the fact that $ 1-x < e^{-x}$ for all $ x$ to translate the $ (1-p)$ part. Combining this with the usual $ \binom{n}{r} \leq n^r$ we get the probability of having an independent set of size $ r$ at most

$ \displaystyle \textup{P}(\alpha(G) \geq r) \leq \displaystyle n^re^{-pr(r-1)/2}$

Now again we want to pick $ r$ so that this quantity goes to zero asymptotically, and it’s not hard to see that $ r = \frac{3}{p}\log(n)$ is good enough. With a little arithmetic we get the probability is at most $ n^{(1-a)/2}$, where $ a > 1$.

So now we have two statements: the expected number of short cycles goes to zero, and the probability that there is an independent set of size at least $ r$ goes to zero. If we pick a large enough $ n$, then the expected number of short cycles is less than $ n/5$, and using Markov’s inequality we see that the probability that there are more than $ n/2$ cycles of length at most $ g$ is strictly less than 1/2. At the same time, if we pick a large enough $ n$ then $ \alpha(G) \geq r$ with probability strictly less than 1/2. Combining these two (once more with the union bound), we get

$ \textup{P}(\textup{at least } n/2 \textup{ cycles of length } \leq g \textup{ and } \alpha(G) \geq r) < 1$

Now we can conclude that for all sufficiently large $ n$ there has to be a graph on at least $ n$ vertices which has neither of these two properties! Pick one and call it $ G$. Now $ G$ still has cycles of length $ \leq g$, but we can fix that by removing a vertex from each short cycle (it doesn’t matter which). Call this new graph $ G’$. How does this operation affect independent sets, i.e. what is $ \alpha(G’)$? Well removing vertices can only decrease the size of the largest independent set. So by our earlier inequality, and calling $ n’$ the number of vertices of $ G’$, we can make a statement about the chromatic number:

$ \displaystyle \chi(G’) \geq \frac{n’}{\alpha(G’)} \geq \frac{n/2}{\log(n) 3/p} = \frac{n/2}{3n^l \log(n)} = \frac{n^{1-l}}{6 \log(n)}$

Since $ -1 < l < 0$ the numerator grows asymptotically faster than the denominator, and so for sufficiently large $ n$ the chromatic number will be larger than any $ k$ we wish. Hence we have found a graph with girth at least $ g$ and chromatic number at least $ k$.

$ \square$

Connected Components

The statistical properties of a random graph are often quite easy to reason about. For instance, the degree of each vertex in $ G(n,p)$ is $ np$ in expectation. Local properties like this are easy, but global properties are a priori very mysterious. One natural question we can ask in this vein is: when is $ G(n,p)$ connected? We would very much expect the answer to depend on how $ p$ changes in relation to $ n$. For instance, $ p$ might look like $ p(n) = 1/n^2$ or $ \log(n) / n$ or something similar. We could ask the following question:

As $ n$ tends to infinity, what limiting proportion of random graphs $ G(n,p)$ are connected?

Certainly for some $ p(n)$ which are egregiously small (for example, $ p(n) = 0$), the answer will be that no graphs are connected. On the other extreme, if $ p(n) = 1$ then all graphs will be connected (and complete graphs, at that!). So our goal is to study the transition phase between when the graphs are disconnected and when they are connected. A priori this boundary could be a gradual slope, where the proportion grows from zero to one, or it could be a sharp jump. Next time, we’ll formally state and prove the truth, but for now let’s see if we can figure out which answer to expect by writing an exploratory program.

We wrote the code for this post in Python, and as usual it is all available for download on this blog’s Github page.

We start with a very basic Node class to represent each vertex in a graph, and a function to generate random graphs

import random
class Node:
   def __init__(self, index):
      self.index = index
      self.neighbors = []

   def __repr__(self):
      return repr(self.index)

def randomGraph(n,p):
   vertices = [Node(i) for i in range(n)]
   edges = [(i,j) for i in xrange(n) for j in xrange(i) if random.random() &lt; p]

   for (i,j) in edges:
      vertices[i].neighbors.append(vertices[j])
      vertices[j].neighbors.append(vertices[i])

   return vertices

The randomGraph function simply creates a list of edges chosen uniformly at random from all possible edges, and then constructs the corresponding graph. Next we have a familiar sight: the depth-first search. We use it to compute the graph components one by one (until all vertices have been found in some run of a DFS).

def dfsComponent(node, visited):
   for v in node.neighbors:
      if v not in visited:
         visited.add(v)
         dfsComponent(v, visited)

def connectedComponents(vertices):
   components = []
   cumulativeVisited = set()

   for v in vertices:
      if v not in cumulativeVisited:
        componentVisited = set([v])
        dfsComponent(v, componentVisited)

        components.append(componentVisited)
        cumulativeVisited |= componentVisited

   return components

The dfsComponent function simply searches in breadth-first fashion, adding every vertex it finds to the “visited” set.  The connectedComponents function keeps track of the list of components found so far, as well as the cumulative set of all vertices found in any run of bfsComponent. Hence, as we iterate through the vertices we can ignore vertices we’ve found in previous runs of bfsComponent. The “x |= y” notation is python shorthand for updating x via a union with y.

Finally, we can make a graph of the largest component of (independently generated) random graphs as the probability of an edge varies.

def sizeOfLargestComponent(vertices):
   return max(len(c) for c in connectedComponents(vertices))

def graphLargestComponentSize(n, theRange):
   return [(p, sizeOfLargestComponent(randomGraph(n, p))) for p in theRange]

Running this code and plotting it for $ p$ varying from zero to 0.5 gives the following graph.

zoomedout-50-1000

The blue line is the size of the largest component, and the red line gives a moving average estimate of the data.  As we can see, there is a very sharp jump peaking at $ p=0.1$ at which point the whole graph is connected. It would appear there is a relatively quick “phase transition” happening here. Let’s zoom in closer on the interesting part.

zoomedin-50-1000

It looks like the transition begins around 0.02, and finishes at around 0.1. Interesting… Let’s change the parameters a bit, and increase the size of the graph. Here’s the same chart (in the same range of $ p$ values) for a graph with a hundred vertices.

zoomedin-100-1000

Now the phase transition appears to have shifted to about $ (0.01, 0.05)$, which is about multiplying the endpoints of the phase transition interval above by 1/2. The plot thickens… Once more, let’s move up to a graph on 500 vertices.

zoomedin-5000-1000

Again it’s too hard to see, so let’s zoom in.

zoomedin-500-1000

This one looks like the transition starts at 0.002 and ends at 0.01. This is a 5-fold decrease from the previous one, and we increased the number of vertices by 5. Could this be a pattern? Here’s a conjecture to formalize it:

Conjecture: The random graph $ G(n,p)$ enters a phase transition at $ p=1/n$ and becomes connected almost surely at $ p=5/n$.

This is not quite rigorous enough to be a true conjecture, but it sums up our intuition that we’ve learned so far. Just to back this up even further, here’s an animation showing the progression of the phase transition as $ n = 20 \dots 500$ in steps of twenty. Note that the $ p$ range is changing to maintain our conjectured window.

phase-transition-n-grows

Looks pretty good. Next time we’ll see some formal mathematics validating our intuition (albeit reformulated in a nicer way), and we’ll continue to investigate other random graph models.

Until then!