# Principal Component Analysis

Problem: Reduce the dimension of a data set, translating each data point into a representation that captures the “most important” features.

Solution: in Python

import numpy

def principalComponents(matrix):
# Columns of matrix correspond to data points, rows to dimensions.

deviationMatrix = (matrix.T - numpy.mean(matrix, axis=1)).T
covarianceMatrix = numpy.cov(deviationMatrix)
eigenvalues, principalComponents = numpy.linalg.eig(covarianceMatrix)

# sort the principal components in decreasing order of corresponding eigenvalue
indexList = numpy.argsort(-eigenvalues)
eigenvalues = eigenvalues[indexList]
principalComponents = principalComponents[:, indexList]

return eigenvalues, principalComponents


Discussion: The problem of reducing the dimension of a dataset in a meaningful way shows up all over modern data analysis. Sophisticated techniques are used to select the most important dimensions, and even more sophisticated techniques are used to reason about what it means for a dimension to be “important.”

One way to solve this problem is to compute the principal components of a datasetFor the method of principal components, “important” is interpreted as the direction of largest variability. Note that these “directions” are vectors which may incorporate a portion of many or all of the “standard” dimensions in question. For instance, the picture below obviously has two different intrinsic dimensions from the standard axes.

The regular reader of this blog may recognize this idea from our post on eigenfaces. Indeed, eigenfaces are simply the principal components of a dataset of face images. We will briefly discuss how the algorithm works here, but leave the why to the post on eigenfaces. The crucial interpretation to make is that finding principal components amounts to a linear transformation of the data (that is, only such operations as rotation, translation, scaling, shear, etc. are allowed) which overlays the black arrows above on the standard axes. In the parlance of linear algebra, we’re re-plotting the data with respect to a convenient orthonormal basis of eigenvectors.

Here we first represent the dataset as a matrix whose columns are the data points, and whose rows represent the different dimensions. For example, if it were financial data then the columns might be the instances of time at which the data was collected, and the rows might represent the prices of the commodities recorded at those times. From here we compute two statistical properties of the dataset: the average datapoint, and the standard deviation of each data point. This is done on line 6 above, where the arithmetic operations are entrywise (a convenient feature of Python’s numpy library).

Next, we compute the covariance matrix for the data points. That is, interpreting each dimension as a random variable and the data points are observations of that random variable, we want to compute how the different dimensions are correlated. One way to estimate this from a sample is to compute the dot products of the deviation vectors and divide by the number of data points. For more details, see this Wikipedia entry.

Now (again, for reasons which we detail in our post on eigenfaces), the eigenvectors of this covariance matrix point in the directions of maximal variance, and the magnitude of the eigenvalues correspond to the magnitude of the variance. Even more, regarding the dimensions a random variables, the correlation between the axes of this new representation are zero! This is part of why this method is so powerful; it represents the data in terms of unrelated features. One downside to this is that the principal component features may have no tractable interpretation in terms of real-life phenomena.

Finally, one common thing to do is only use the first few principal components, where by ‘first’ we mean those whose corresponding eigenvalues are the largest. Then one projects the original data points onto the chosen principal components, thus controlling precisely the dimension the data is reduced to. One important question is: how does one decide how many principal components to use?

Because the principal components with larger eigenvalues correspond to features with more variability, one can compute the total variation accounted for with a given set of principal components. Here, the ‘total variation’ is the sum of the variance of each of the random variables (that is, the trace of the covariance matrix, i.e. the sum of its eigenvalues). Since the eigenvalues correspond to the variation in the chosen principal components, we can naturally compute the accounted variation as a proportion. Specifically, if $\lambda_1, \dots \lambda_k$ are the eigenvalues of the chosen principal components, and $\textup{tr}(A)$ is the trace of the covariance matrix, then the total variation covered by the chosen principal components is simply $(\lambda_1 + \dots + \lambda_k)/\textup{tr}(A)$.

In many cases of high-dimensional data, one can encapsulate more than 90% of the total variation using a small fraction of the principal components. In our post on eigenfaces we used a relatively homogeneous dataset of images; our recognition algorithm performed quite well using only about 20 out of 36,000 principal components. Note that there were also some linear algebra tricks to compute only those principal components which had nonzero eigenvalues. In any case, it is clear that if the data is nice enough, principal component analysis is a vey powerful tool.

# Streaming Median

Problem: Compute a reasonable approximation to a “streaming median” of a potentially infinite sequence of integers.

Solution: (in Python)

def streamingMedian(seq):
seq = iter(seq)
m = 0

for nextElt in seq:
if m > nextElt:
m -= 1
elif m < nextElt:
m += 1

yield m


DiscussionBefore we discuss the details of the Python implementation above, we should note a few things.

First, because the input sequence is potentially infinite, we can’t store any amount of information that is increasing in the length of the sequence. Even though storing something like $O(\log(n))$ integers would be reasonable for the real world (note that the log of a petabyte is about 60 bytes), we should not let that stop us from shooting for the ideal $O(1)$ space bound, and exploring what sorts of solutions arise under that constraint. For the record, I don’t know of any algorithms to compute the true streaming median which require $O(\log(n))$ space, and I would be very interested to see one.

Second, we should note the motivation for this problem. If the process generating the stream of numbers doesn’t change over time, then one can find a reasonably good approximation to the median of the entire sequence using only a sufficiently large, but finite prefix of the sequence. But if the process does change, then all bets are off. We need an algorithm which can compensate for potentially wild changes in the statistical properties of the sequence. It’s unsurprising that such a naive algorithm would do the trick, because it can’t make any assumptions.

In words, the algorithm works as follows: start with some initial guess for the median $m$. For each element $x$ in the sequence, add one to $m$ if $m$ is less than $x$; subtract one if $m$ is greater than $x$, and do nothing otherwise.

In the Python above, we make use of generators to represent infinite sequences of data. A generator is an object with some iterable state, which yields some value at each step. The simplest possible non-trivial generator generates the natural numbers $\mathbb{N}$:

def naturalNumbers():
n = 1
while True:
yield n
n += 1

What Python does with this function is translate it into a generator object “g” which works as follows: when something calls next(g), the function computes as usual until it reaches a yield statement; then it returns the yielded value, saves all of its internal state, and then returns control to the caller. The next time next(g) is called, this process repeats. A generator can be infinite or finite, and it terminates the iteration process when the function “falls off the end,” returning None as a function which has no return statement would.

We should note as well that Python knows how to handle generators with the usual “for … in …” language form. This makes it extremely handy, because programmers don’t have to care whether they’re using a list or an iterator; the syntax to work with them is identical.

Now the “streamingMedian” function we began with accepts as input a generator (or any iterable object, which it converts to a generator with the “iter” function). It then computes the streaming median of that generator as part of another generator, so that one can call it, e.g., with the code:

for medianSoFar in streamingMedian(naturalNumbers()):
print medianSoFar

# Holidays and Homicide

## A Study In Data

Just before midnight on Thanksgiving, there was a murder by gunshot about four blocks from my home. Luckily I was in bed by then, but all of the commotion over the incident got me thinking: is murder disproportionately more common on Thanksgiving? What about Christmas, Valentine’s Day, or Saint Patrick’s Day?

Of course, with the right data set these are the kinds of questions one can answer! After an arduous Google search for agreeable data sets, I came across this project called the History of Violence Database (perhaps the most ominous database name ever!). Unfortunately most of the work in progress there puts the emphasis on history, cataloging homicide incidents only earlier than the 1920′s.

But one page I came across contained a complete list of the dates of each known homicide in San Francisco from 1849-2003. What’s more, it is available in a simple, comma-delimited format. (To all data compilers everywhere: this is how data should be transferred! Don’t only provide it in Excel, or SPSS, or whatever proprietary software you might use. Text files are universally accessible.)

With a little sed preprocessing and some Mathematica magic, I whipped together this chart of homicide counts by day of the year (click on the image to get a larger view):

Homicides in San Francsico 1849 - 2003, organized by day of the year.

Here the red grid lines mark the highest ten homicide counts, the green grid lines mark the lowest ten, and the blue lines mark specific holidays. Some of the blue and red lines cover the same date, and so in that case the red line is the one displayed. Finally, the horizontal yellow line represents the median homicide count of 19, so one can compare individual dates against the median.

Now it would be a terrible thing to “conclude” something about general human behavior from this particular data set. But I’m going to do it anyway because it’s fun, and it lets me make fascinating and controversial observations. Plus, I need interesting tidbits of information to drop at parties with math and statistics students.

Here are my observations:

• There is a correlation between some holidays and homicide.
• New Year’s Day is by far the most violent day of the year, followed by Christmas Day. On the other hand, Christmas Eve is only slightly above average.
• The safest days of the year is January 5th. Having no other special recognition, it should be deemed National Peace Day, or at least National Too Pooped from New Year’s to Commit Murder Day.
• New Year’s Day (likely, Morning) is not dangerous because of excessive alcohol consumption alone. If it were, Saint Patrick’s Day would surely be similar. Although to be fair, one should compare this with the same statistic for Dublin.
• None of the following holidays are significantly more dangerous than the average day: Groundhog Day, Valentine’s Day, St. Patrick’s Day, April Fool’s Day, Cinco de Mayo, my birthday (August 28th), Halloween, Veteran’s Day, and New Year’s Eve (before midnight).
• On the other hand, the days following July 4th are quite vicious, even though July 3rd is very quiet.
• The days near November 24th are particularly violent because Thanksgiving (and Black Friday?) often fall on those days. Family gatherings are clearly high-risk events.
• Last-minute Christmas shopping (Dec. 13th, 15th) obviously brings out the rage in everyone.
• March and October are the docile months, while January, June, July, and December are the worst. Murder happens more often in the usual vacation months than at other times of the year.

Of course, some of the above points are meant to be facetious, but they bring up some interesting questions. Why does homicide show up more often during certain months of the year? It’s well known that most murder victims are personal acquaintances of the assailant (specifically, friends and family members); does this mean that family time rejuvenates or fuels strife? Are people more stressed during these times of year? Are June and July more violent simply because heat aggravates people, or at least gets them to interact with others more?

Unfortunately these aren’t the kinds of questions that lend themselves to mathematical proof, and as such I can’t give a definitive answer. But feel free to voice your own opinion in the comments!

For your viewing pleasure, here are the homicide counts on each day for the top and bottom 67 days, respectively, in ascending order:

highest:
{{"04/05", 24}, {"04/19", 24}, {"05/15", 24}, {"05/24", 24},
{"05/25", 24}, {"06/28", 24}, {"08/05", 24}, {"08/17", 24},
{"09/01", 24}, {"09/03", 24}, {"09/19", 24}, {"09/20", 24},
{"10/26", 24}, {"11/02", 24}, {"11/23", 24}, {"02/01", 25},
{"02/08", 25}, {"02/11", 25}, {"02/15", 25}, {"04/21", 25},
{"05/07", 25}, {"06/04", 25}, {"06/07", 25}, {"07/24", 25},
{"08/01", 25}, {"08/25", 25}, {"09/07", 25}, {"10/21", 25},
{"10/23", 25}, {"11/03", 25}, {"11/10", 25}, {"11/27", 25},
{"01/06", 26}, {"01/18", 26}, {"03/01", 26}, {"03/24", 26},
{"05/30", 26}, {"07/04", 26}, {"07/29", 26}, {"08/09", 26},
{"09/21", 26}, {"11/09", 26}, {"11/25", 26}, {"12/06", 26},
{"05/02", 27}, {"09/06", 27}, {"09/24", 27}, {"10/11", 27},
{"11/08", 27}, {"12/12", 27}, {"12/20", 27}, {"07/18", 28},
{"09/04", 28}, {"12/02", 28}, {"05/18", 29}, {"07/01", 29},
{"07/07", 29}, {"10/27", 29}, {"11/24", 29}, {"01/28", 30},
{"01/24", 31}, {"07/05", 31}, {"08/02", 31}, {"12/15", 31},
{"12/13", 32}, {"12/25", 36}, {"01/01", 43}}
lowest:
{{"01/05", 6}, {"02/29", 6}, {"06/20", 7}, {"07/03", 7},
{"04/15", 8}, {"02/03", 9}, {"02/10", 9}, {"09/16", 9},
{"03/04", 10}, {"05/16", 10}, {"09/29", 10}, {"10/12", 10},
{"12/16", 10}, {"02/04", 11}, {"05/17", 11}, {"05/23", 11},
{"06/06", 11}, {"06/12", 11}, {"07/11", 11}, {"09/22", 11},
{"11/12", 11}, {"11/16", 11}, {"12/19", 11}, {"03/13", 12},
{"03/20", 12}, {"05/21", 12}, {"05/29", 12}, {"07/14", 12},
{"08/13", 12}, {"09/05", 12}, {"10/28", 12}, {"11/22", 12},
{"01/19", 13}, {"02/06", 13}, {"02/09", 13}, {"04/22", 13},
{"04/28", 13}, {"05/04", 13}, {"05/11", 13}, {"08/27", 13},
{"09/15", 13}, {"10/03", 13}, {"10/13", 13}, {"12/03", 13},
{"01/10", 14}, {"01/26", 14}, {"01/31", 14}, {"02/13", 14},
{"02/18", 14}, {"02/23", 14}, {"03/14", 14}, {"03/15", 14},
{"03/26", 14}, {"03/31", 14}, {"04/11", 14}, {"05/12", 14},
{"05/28", 14}, {"06/19", 14}, {"06/24", 14}, {"07/20", 14},
{"07/31", 14}, {"08/10", 14}, {"10/22", 14}, {"12/22", 14},
{"01/07", 15}, {"01/14", 15}, {"01/16", 15}}

Unfortunately many holidays are defined year by year. Thanksgiving, Easter, Mother’s Day, Father’s Day, MLK Jr. Day, Memorial Day, and the Chinese New Year all can’t be analyzed with this chart because they fall on different days each year. It would not be very difficult to organize this data set according to the rules of those special holidays. We leave it as an exercise to the reader to do so. Remember that the data and Mathematica notebook used in this post are available from this blog’s Google Code page.

And of course, it would be quite amazing to find a data set which provides the dates of (even the last fifty years of) homicide history in the entire US, and not just one city. If any readers know of such a data set, please let me know.

Until next time!