Big Dimensions, and What You Can Do About It

Data is abundant, data is big, and big is a problem. Let me start with an example. Let’s say you have a list of movie titles and you want to learn their genre: romance, action, drama, etc. And maybe in this scenario IMDB doesn’t exist so you can’t scrape the answer. Well, the title alone is almost never enough information. One nice way to get more data is to do the following:

  1. Pick a large dictionary of words, say the most common 100,000 non stop-words in the English language.
  2. Crawl the web looking for documents that include the title of a film.
  3. For each film, record the counts of all other words appearing in those documents.
  4. Maybe remove instances of “movie” or “film,” etc.

After this process you have a length-100,000 vector of integers associated with each movie title. IMDB’s database has around 1.5 million listed movies, and if we have a 32-bit integer per vector entry, that’s 600 GB of data to get every movie.

One way to try to find genres is to cluster this (unlabeled) dataset of vectors, and then manually inspect the clusters and assign genres. With a really fast computer we could simply run an existing clustering algorithm on this dataset and be done. Of course, clustering 600 GB of data takes a long time, but there’s another problem. The geometric intuition that we use to design clustering algorithms degrades as the length of the vectors in the dataset grows. As a result, our algorithms perform poorly. This phenomenon is called the “curse of dimensionality” (“curse” isn’t a technical term), and we’ll return to the mathematical curiosities shortly.

A possible workaround is to try to come up with faster algorithms or be more patient. But a more interesting mathematical question is the following:

Is it possible to condense high-dimensional data into smaller dimensions and retain the important geometric properties of the data?

This goal is called dimension reduction. Indeed, all of the chatter on the internet is bound to encode redundant information, so for our movie title vectors it seems the answer should be “yes.” But the questions remain, how does one find a low-dimensional condensification? (Condensification isn’t a word, the right word is embedding, but embedding is overloaded so we’ll wait until we define it) And what mathematical guarantees can you prove about the resulting condensed data? After all, it stands to reason that different techniques preserve different aspects of the data. Only math will tell.

In this post we’ll explore this so-called “curse” of dimensionality, explain the formality of why it’s seen as a curse, and implement a wonderfully simple technique called “the random projection method” which preserves pairwise distances between points after the reduction. As usual, and all the code, data, and tests used in the making of this post are on Github.

Some curious issues, and the “curse”

We start by exploring the curse of dimensionality with experiments on synthetic data.

In two dimensions, take a circle centered at the origin with radius 1 and its bounding square.

circle.png

The circle fills up most of the area in the square, in fact it takes up exactly \pi out of 4 which is about 78%. In three dimensions we have a sphere and a cube, and the ratio of sphere volume to cube volume is a bit smaller, 4 \pi /3 out of a total of 8, which is just over 52%. What about in a thousand dimensions? Let’s try by simulation.

import random

def randUnitCube(n):
   return [(random.random() - 0.5)*2 for _ in range(n)]

def sphereCubeRatio(n, numSamples):
   randomSample = [randUnitCube(n) for _ in range(numSamples)]
   return sum(1 for x in randomSample if sum(a**2 for a in x) <= 1) / numSamples 

The result is as we computed for small dimension,

 >>> sphereCubeRatio(2,10000)
0.7857
>>> sphereCubeRatio(3,10000)
0.5196

And much smaller for larger dimension

>>> sphereCubeRatio(20,100000) # 100k samples
0.0
>>> sphereCubeRatio(20,1000000) # 1M samples
0.0
>>> sphereCubeRatio(20,2000000)
5e-07

Forget a thousand dimensions, for even twenty dimensions, a million samples wasn’t enough to register a single random point inside the unit sphere. This illustrates one concern, that when we’re sampling random points in the d-dimensional unit cube, we need at least 2^d samples to ensure we’re getting a even distribution from the whole space. In high dimensions, this face basically rules out a naive Monte Carlo approximation, where you sample random points to estimate the probability of an event too complicated to sample from directly. A machine learning viewpoint of the same problem is that in dimension d you will usually require 2^d samples in order to infer anything useful.

Luckily, we can answer our original question because there is a known formula for the volume of a sphere in any dimension. Rather than give the closed form formula, which involves the gamma function and is incredibly hard to parse, we’ll state the recursive form. Call V_i the volume of the unit sphere in dimension i. Then V_0 = 1 by convention, V_1 = 2 (it’s an interval), and V_n = \frac{2 \pi V_{n-2}}{n}. If you unpack this recursion you can see that the numerator looks like (2\pi)^{n/2} and the denominator looks like a factorial, except it skips every other number. So an even dimension would look like 2 \cdot 4 \cdot \dots \cdot n, and this grows larger than a fixed exponential. So in fact the total volume of the sphere vanishes as the dimension grows! (In addition to the ratio vanishing!)

def sphereVolume(n):
   values = [0] * (n+1)
   for i in range(n+1):
      if i == 0:
         values[i] = 1
      elif i == 1:
         values[i] = 2
      else:
         values[i] = 2*math.pi / i * values[i-2]

   return values[-1]

This should be counterintuitive. I think most people would guess, when asked about how the volume of the unit sphere changes as the dimension grows, that it stays the same or gets bigger.  But at a hundred dimensions, the volume is already getting too small to fit in a float.

>>> sphereVolume(20)
0.025806891390014047
>>> sphereVolume(100)
2.3682021018828297e-40
>>> sphereVolume(1000)
0.0

The scary thing is not just that this value drops, but that it drops exponentially quickly. A consequence is that, if you’re trying to cluster data points by looking at points within a fixed distance r of one point, you’ll have to make r exponentially large in the dimension.

Here’s a related issue. Say I take a bunch of points generated uniformly at random in the unit cube.

from itertools import combinations

def distancesRandomPoints(n, numSamples):
   randomSample = [randUnitCube(n) for _ in range(numSamples)]
   pairwiseDistances = [dist(x,y) for (x,y) in combinations(randomSample, 2)]
   return pairwiseDistances

In two dimensions, the histogram of distances between points looks like this

2d-distances.png

However, as the dimension grows the distribution of distances changes. It evolves like the following animation, in which each frame is an increase in dimension from 2 to 100.

distances-animation.gif

The shape of the distribution doesn’t appear to be changing all that much after the first few frames, but the center of the distribution tends to infinity (in fact, it grows like \sqrt{n}). The variance also appears to stay constant. This chart also becomes more variable as the dimension grows, again because we should be sampling exponentially many more points as the dimension grows (but we don’t). In other words, as the dimension grows the average distance grows and the tightness of the distribution stays the same. So at a thousand dimensions the average distance is about 26, tightly concentrated between 24 and 28. When the average is a thousand, the distribution is tight between 998 and 1002. If one were to normalize this data, it would appear that random points are all becoming equidistant from each other.

So in addition to the issues of runtime and sampling, the geometry of high-dimensional space looks different from what we expect. To get a better understanding of “big data,” we have to update our intuition from low-dimensional geometry with analysis and mathematical theorems that are much harder to visualize.

The Johnson-Lindenstrauss Lemma

Now we turn to proving dimension reduction is possible. There are a few methods one might first think of, such as look for suitable subsets of coordinates, or sums of subsets, but these would all appear to take a long time or they simply don’t work.

Instead, the key technique is to take a random linear subspace of a certain dimension, and project every data point onto that subspace. No searching required. The fact that this works is called the Johnson-Lindenstrauss Lemma. To set up some notation, we’ll call d(v,w) the usual distance between two points.

Lemma [Johnson-Lindenstrauss (1984)]: Given a set X of n points in \mathbb{R}^d, project the points in X to a randomly chosen subspace of dimension c. Call the projection \rho. For any \varepsilon > 0, if c is at least \Omega(\log(n) / \varepsilon^2), then with probability at least 1/2 the distances between points in X are preserved up to a factor of (1+\varepsilon). That is, with good probability every pair v,w \in X will satisfy

\displaystyle \| v-w \|^2 (1-\varepsilon) \leq \| \rho(v) - \rho(w) \|^2 \leq \| v-w \|^2 (1+\varepsilon)

Before we do the proof, which is quite short, it’s important to point out that the target dimension c does not depend on the original dimension! It only depends on the number of points in the dataset, and logarithmically so. That makes this lemma seem like pure magic, that you can take data in an arbitrarily high dimension and put it in a much smaller dimension.

On the other hand, if you include all of the hidden constants in the bound on the dimension, it’s not that impressive. If your data have a million dimensions and you want to preserve the distances up to 1% (\varepsilon = 0.01), the bound is bigger than a million! If you decrease the preservation \varepsilon to 10% (0.1), then you get down to about 12,000 dimensions, which is more reasonable. At 45% the bound drops to around 1,000 dimensions. Here’s a plot showing the theoretical bound on c in terms of \varepsilon for n fixed to a million.

boundplot

 

But keep in mind, this is just a theoretical bound for potentially misbehaving data. Later in this post we’ll see if the practical dimension can be reduced more than the theory allows. As we’ll see, an algorithm run on the projected data is still effective even if the projection goes well beyond the theoretical bound. Because the theorem is known to be tight in the worst case (see the notes at the end) this speaks more to the robustness of the typical algorithm than to the robustness of the projection method.

A second important note is that this technique does not necessarily avoid all the problems with the curse of dimensionality. We mentioned above that one potential problem is that “random points” are roughly equidistant in high dimensions. Johnson-Lindenstrauss actually preserves this problem because it preserves distances! As a consequence, you won’t see strictly better algorithm performance if you project (which we suggested is possible in the beginning of this post). But you will alleviate slow runtimes if the runtime depends exponentially on the dimension. Indeed, if you replace the dimension d with the logarithm of the number of points \log n, then 2^d becomes linear in n, and 2^{O(d)} becomes polynomial.

Proof of the J-L lemma

Let’s prove the lemma.

Proof. To start we make note that one can sample from the uniform distribution on dimension-c linear subspaces of \mathbb{R}^d by choosing the entries of a c \times d matrix A independently from a normal distribution with mean 0 and variance 1. Then, to project a vector x by this matrix (call the projection \rho), we can compute

\displaystyle \rho(x) = \frac{1}{\sqrt{c}}A x

Now fix \varepsilon > 0 and fix two points in the dataset x,y. We want an upper bound on the probability that the following is false

\displaystyle \| x-y \|^2 (1-\varepsilon) \leq \| \rho(x) - \rho(y) \|^2 \leq \| x-y \|^2 (1+\varepsilon)

Since that expression is a pain to work with, let’s rearrange it by calling u = x-y, and rearranging (using the linearity of the projection) to get the equivalent statement.

\left | \| \rho(u) \|^2 - \|u \|^2 \right | \leq \varepsilon \| u \|^2

And so we want a bound on the probability that this event does not occur, meaning the inequality switches directions.

Once we get such a bound (it will depend on c and \varepsilon) we need to ensure that this bound is true for every pair of points. The union bound allows us to do this, but it also requires that the probability of the bad thing happening tends to zero faster than 1/\binom{n}{2}. That’s where the \log(n) will come into the bound as stated in the theorem.

Continuing with our use of u for notation, define X to be the random variable \frac{c}{\| u \|^2} \| \rho(u) \|^2. By expanding the notation and using the linearity of expectation, you can show that the expected value of X is c, meaning that in expectation, distances are preserved. We are on the right track, and just need to show that the distribution of X, and thus the possible deviations in distances, is tightly concentrated around c. In full rigor, we will show

\displaystyle \Pr [X \geq (1+\varepsilon) c] < e^{-(\varepsilon^2 - \varepsilon^3) \frac{c}{4}}

Let A_i denote the i-th column of A. Define by X_i the quantity \langle A_i, u \rangle / \| u \|. This is a weighted average of the entries of A_i by the entries of u. But since we chose the entries of A from the normal distribution, and since a weighted average of normally distributed random variables is also normally distributed (has the same distribution), X_i is a N(0,1) random variable. Moreover, each column is independent. This allows us to decompose X as

X = \frac{k}{\| u \|^2} \| \rho(u) \|^2 = \frac{\| Au \|^2}{\| u \|^2}

Expanding further,

X = \sum_{i=1}^c \frac{\| A_i u \|^2}{\|u\|^2} = \sum_{i=1}^c X_i^2

Now the event X \leq (1+\varepsilon) c can be expressed in terms of the nonegative variable e^{\lambda X}, where 0 < \lambda < 1/2 is parameter, to get

\displaystyle \Pr[X \geq (1+\varepsilon) c] = \Pr[e^{\lambda X} \geq e^{(1+\varepsilon)c \lambda}]

This will become useful because the sum X = \sum_i X_i^2 will split into a product momentarily. First we apply Markov’s inequality, which says that for any nonnegative random variable Y, \Pr[Y \geq t] \leq \mathbb{E}[Y] / t. This lets us write

\displaystyle \Pr[e^{\lambda X} \geq e^{(1+\varepsilon) c \lambda}] \leq \frac{\mathbb{E}[e^{\lambda X}]}{e^{(1+\varepsilon) c \lambda}}

Now we can split up the exponent \lambda X into \sum_{i=1}^c \lambda X_i^2, and using the i.i.d.-ness of the X_i^2 we can rewrite the RHS of the inequality as

\left ( \frac{\mathbb{E}[e^{\lambda X_1^2}]}{e^{(1+\varepsilon)\lambda}} \right )^c

A similar statement using -\lambda is true for the (1-\varepsilon) part, namely that

\displaystyle \Pr[X \leq (1-\varepsilon)c] \leq \left ( \frac{\mathbb{E}[e^{-\lambda X_1^2}]}{e^{-(1-\varepsilon)\lambda}} \right )^c

The last thing that’s needed is to bound \mathbb{E}[e^{\lambda X_i^2}], but since X_i^2 \sim N(0,1), we can use the known density function for a normal distribution, and integrate to get the exact value \mathbb{E}[e^{\lambda X_1^2}] = \frac{1}{\sqrt{1-2\lambda}}. Including this in the bound gives us a closed-form bound in terms of \lambda, c, \varepsilon. Using standard calculus the optimal \lambda \in (0,1/2) is \lambda = \varepsilon / 2(1+\varepsilon). This gives

\displaystyle \Pr[X \geq (1+\varepsilon) c] \leq ((1+\varepsilon)e^{-\varepsilon})^{c/2}

Using the Taylor series expansion for e^x, one can show the bound 1+\varepsilon < e^{\varepsilon - (\varepsilon^2 - \varepsilon^3)/2}, which simplifies the final upper bound to e^{-(\varepsilon^2 - \varepsilon^3) c/4}.

Doing the same thing for the (1-\varepsilon) version gives an equivalent bound, and so the total bound is doubled, i.e. 2e^{-(\varepsilon^2 - \varepsilon^3) c/4}.

As we said at the beginning, applying the union bound means we need

\displaystyle 2e^{-(\varepsilon^2 - \varepsilon^3) c/4} < \frac{1}{\binom{n}{2}}

Solving this for c gives c \geq \frac{8 \log m}{\varepsilon^2 - \varepsilon^3}, as desired.

\square

Projecting in Practice

Let’s write a python program to actually perform the Johnson-Lindenstrauss dimension reduction scheme. This is sometimes called the Johnson-Lindenstrauss transform, or JLT.

First we define a random subspace by sampling an appropriately-sized matrix with normally distributed entries, and a function that performs the projection onto a given subspace (for testing).

import random                                                                                       
import math
import numpy

def randomSubspace(subspaceDimension, ambientDimension):
   return numpy.random.normal(0, 1, size=(subspaceDimension, ambientDimension))

def project(v, subspace):
   subspaceDimension = len(subspace)
   return (1 / math.sqrt(subspaceDimension)) * subspace.dot(v)

We have a function that computes the theoretical bound on the optimal dimension to reduce to.

def theoreticalBound(n, epsilon):
   return math.ceil(8*math.log(n) / (epsilon**2 - epsilon**3))

And then performing the JLT is simply matrix multiplication

def jlt(data, subspaceDimension):
   ambientDimension = len(data[0])
   A = randomSubspace(subspaceDimension, ambientDimension)
   return (1 / math.sqrt(subspaceDimension)) * A.dot(data.T).T

The high-dimensional dataset we’ll use comes from a data mining competition called KDD Cup 2001. The dataset we used deals with drug design, and the goal is to determine whether an organic compound binds to something called thrombin. Thrombin has something to do with blood clotting, and I won’t pretend I’m an expert. The dataset, however, has over a hundred thousand features for about 2,000 compounds. Here are a few approximate target dimensions we can hope for as epsilon varies.

>>> [((1/x),theoreticalBound(n=2000, epsilon=1/x)) 
       for x in [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20]]
[('0.50', 487), ('0.33', 821), ('0.25', 1298), ('0.20', 1901), 
 ('0.17', 2627), ('0.14', 3477), ('0.12', 4448), ('0.11', 5542), 
 ('0.10', 6757), ('0.07', 14659), ('0.05', 25604)]

Going down from a hundred thousand dimensions to a few thousand is by any measure decreases the size of the dataset by about 95%. We can also observe how the distribution of overall distances varies as the size of the subspace we project to varies.

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The last three frames are for 10, 5, and 2 dimensions respectively. As you can see the histogram starts to beef up around zero. To be honest I was expecting something a bit more dramatic like a uniform-ish distribution. Of course, the distribution of distances is not all that matters. Another concern is the worst case change in distances between any two points before and after the projection. We can see that indeed when we project to the dimension specified in the theorem, that the distances are within the prescribed bounds.

def checkTheorem(oldData, newData, epsilon):
   numBadPoints = 0

   for (x,y), (x2,y2) in zip(combinations(oldData, 2), combinations(newData, 2)):
      oldNorm = numpy.linalg.norm(x2-y2)**2
      newNorm = numpy.linalg.norm(x-y)**2

      if newNorm == 0 or oldNorm == 0:
         continue

      if abs(oldNorm / newNorm - 1) > epsilon:
         numBadPoints += 1

   return numBadPoints

if __name__ == "__main__"
   from data import thrombin 
   train, labels = thrombin.load() 
   
   numPoints = len(train)
   epsilon = 0.2
   subspaceDim = theoreticalBound(numPoints, epsilon)
   ambientDim = len(train[0])
   newData = jlt(train, subspaceDim)
                                                                                                           
   print(checkTheorem(train, newData, epsilon))

This program prints zero every time I try running it, which is the poor man’s way of saying it works “with high probability.” We can also plot statistics about the number of pairs of data points that are distorted by more than \varepsilon as the subspace dimension shrinks. We ran this on the following set of subspace dimensions with \varepsilon = 0.1 and took average/standard deviation over twenty trials:

   dims = [1000, 750, 500, 250, 100, 75, 50, 25, 10, 5, 2]

The result is the following chart, whose x-axis is the dimension projected to (so the left hand is the most extreme projection to 2, 5, 10 dimensions), the y-axis is the number of distorted pairs, and the error bars represent a single standard deviation away from the mean.

thrombin-worst-case

This chart provides good news about this dataset because the standard deviations are low. It tells us something that mathematicians often ignore: the predictability of the tradeoff that occurs once you go past the theoretically perfect bound. In this case, the standard deviations tell us that it’s highly predictable. Moreover, since this tradeoff curve measures pairs of points, we might conjecture that the distortion is localized around a single set of points that got significantly “rattled” by the projection. This would be an interesting exercise to explore.

Now all of these charts are really playing with the JLT and confirming the correctness of our code (and hopefully our intuition). The real question is: how well does a machine learning algorithm perform on the original data when compared to the projected data? If the algorithm only “depends” on the pairwise distances between the points, then we should expect nearly identical accuracy in the unprojected and projected versions of the data. To show this we’ll use an easy learning algorithm, the k-nearest-neighbors clustering method. The problem, however, is that there are very few positive examples in this particular dataset. So looking for the majority label of the nearest k neighbors for any k > 2 unilaterally results in the “all negative” classifier, which has 97% accuracy. This happens before and after projecting.

To compensate for this, we modify k-nearest-neighbors slightly by having the label of a predicted point be 1 if any label among its nearest neighbors is 1. So it’s not a majority vote, but rather a logical OR of the labels of nearby neighbors. Our point in this post is not to solve the problem well, but rather to show how an algorithm (even a not-so-good one) can degrade as one projects the data into smaller and smaller dimensions. Here is the code.

def nearestNeighborsAccuracy(data, labels, k=10):
   from sklearn.neighbors import NearestNeighbors
   trainData, trainLabels, testData, testLabels = randomSplit(data, labels) # cross validation
   model = NearestNeighbors(n_neighbors=k).fit(trainData)
   distances, indices = model.kneighbors(testData)
   predictedLabels = []

   for x in indices:
      xLabels = [trainLabels[i] for i in x[1:]]
      predictedLabel = max(xLabels)
      predictedLabels.append(predictedLabel)

   totalAccuracy = sum(x == y for (x,y) in zip(testLabels, predictedLabels)) / len(testLabels)
   falsePositive = (sum(x == 0 and y == 1 for (x,y) in zip(testLabels, predictedLabels)) /
      sum(x == 0 for x in testLabels))
   falseNegative = (sum(x == 1 and y == 0 for (x,y) in zip(testLabels, predictedLabels)) /
      sum(x == 1 for x in testLabels))

   return totalAccuracy, falsePositive, falseNegative

And here is the accuracy of this modified k-nearest-neighbors algorithm run on the thrombin dataset. The horizontal line represents the accuracy of the produced classifier on the unmodified data set. The x-axis represents the dimension projected to (left-hand side is the lowest), and the y-axis represents the accuracy. The mean accuracy over fifty trials was plotted, with error bars representing one standard deviation. The complete code to reproduce the plot is in the Github repository [link link link].

thrombin-knn-accuracy

Likewise, we plot the proportion of false positive and false negatives for the output classifier. Note that a “positive” label made up only about 2% of the total data set. First the false positives

thrombin-knn-fp

Then the false negatives

thrombin-knn-fn

As we can see from these three charts, things don’t really change that much (for this dataset) even when we project down to around 200-300 dimensions. Note that for these parameters the “correct” theoretical choice for dimension was on the order of 5,000 dimensions, so this is a 95% savings from the naive approach, and 99.75% space savings from the original data. Not too shabby.

Notes

The \Omega(\log(n)) worst-case dimension bound is asymptotically tight, though there is some small gap in the literature that depends on \varepsilon. This result is due to Noga Alon, the very last result (Section 9) of this paper.

We did dimension reduction with respect to preserving the Euclidean distance between points. One might naturally wonder if you can achieve the same dimension reduction with a different metric, say the taxicab metric or a p-norm. In fact, you cannot achieve anything close to logarithmic dimension reduction for the taxicab (l_1) metric. This result is due to Brinkman-Charikar in 2004.

The code we used to compute the JLT is not particularly efficient. There are much more efficient methods. One of them, borrowing its namesake from the Fast Fourier Transform, is called the Fast Johnson-Lindenstrauss Transform. The technique is due to Ailon-Chazelle from 2009, and it involves something called “preconditioning a sparse projection matrix with a randomized Fourier transform.” I don’t know precisely what that means, but it would be neat to dive into that in a future post.

The central focus in this post was whether the JLT preserves distances between points, but one might be curious as to whether the points themselves are well approximated. The answer is an enthusiastic no. If the data were images, the projected points would look nothing like the original images. However, it appears the degradation tradeoff is measurable (by some accounts perhaps linear), and there appears to be some work (also this by the same author) when restricting to sparse vectors (like word-association vectors).

Note that the JLT is not the only method for dimensionality reduction. We previously saw principal component analysis (applied to face recognition), and in the future we will cover a related technique called the Singular Value Decomposition. It is worth noting that another common technique specific to nearest-neighbor is called “locality-sensitive hashing.” Here the goal is to project the points in such a way that “similar” points land very close to each other. Say, if you were to discretize the plane into bins, these bins would form the hash values and you’d want to maximize the probability that two points with the same label land in the same bin. Then you can do things like nearest-neighbors by comparing bins.

Another interesting note, if your data is linearly separable (like the examples we saw in our age-old post on Perceptrons), then you can use the JLT to make finding a linear separator easier. First project the data onto the dimension given in the theorem. With high probability the points will still be linearly separable. And then you can use a perceptron-type algorithm in the smaller dimension. If you want to find out which side a new point is on, you project and compare with the separator in the smaller dimension.

Beyond its interest for practical dimensionality reduction, the JLT has had many other interesting theoretical consequences. More generally, the idea of “randomly projecting” your data onto some small dimensional space has allowed mathematicians to get some of the best-known results on many optimization and learning problems, perhaps the most famous of which is called MAX-CUT; the result is by Goemans-Williamson and it led to a mathematical constant being named after them, \alpha_{GW} =.878567 \dots. If you’re interested in more about the theory, Santosh Vempala wrote a wonderful (and short!) treatise dedicated to this topic.

randomprojectionbook

Concrete Examples of Quantum Gates

So far in this series we’ve seen a lot of motivation and defined basic ideas of what a quantum circuit is. But on rereading my posts, I think we would all benefit from some concreteness.

“Local” operations

So by now we’ve understood that quantum circuits consist of a sequence of gates A_1, \dots, A_k, where each A_i is an 8-by-8 matrix that operates “locally” on some choice of three (or fewer) qubits. And in your head you imagine starting with some state vector v and applying each A_i locally to its three qubits until the end when you measure the state and get some classical output.

But the point I want to make is that A_i actually changes the whole state vector v, because the three qubits it acts “locally” on are part of the entire basis. Here’s an example. Suppose we have three qubits and they’re in the state

\displaystyle v = \frac{1}{\sqrt{14}} (e_{001} + 2e_{011} - 3e_{101})

Recall we abbreviate basis states by subscripting them by binary strings, so e_{011} = e_0 \otimes e_1 \otimes e_1, and a valid state is any unit vector over the 2^3 = 8 possible basis elements. As a vector, this state is \frac{1}{\sqrt{14}} (0,1,0,2,0,-3,0,0)

Say we apply the gate A that swaps the first and third qubits. “Locally” this gate has the following matrix:

\displaystyle \begin{pmatrix} 1&0&0&0 \\ 0&0&1&0 \\ 0&1&0&0 \\ 0&0&0&1 \end{pmatrix}

where we index the rows and columns by the relevant strings in lexicographic order: 00, 01, 10, 11. So this operation leaves e_{00} and e_{11} the same while swapping the other two. However, as an operation on three qubits the operation looks quite different. And it’s sort of hard to describe a general way to write it down as a matrix because of the choice of indices. There are three different perspectives.

Perspective 1: if the qubits being operated on are sequential (like, the third, fourth, and fifth qubits), then we can write the matrix as I_{2^{a}} \otimes A \otimes I_{2^{b}} where a tensor product of matrices is the Kronecker product and a + b + \log \textup{dim}(A) = n (the number of qubits adds up). Then the final operation looks like a “tiled product” of identity matrices by A, but it’s a pain to write out. Let me hurt my self for your sake, dear reader.

tensormatrix1

And each copy of A \otimes I_{2^b} looks like

tiled-piece

 

That’s a mess, but if you write it out for our example of swapping the first and third qubits of a three-qubit register you get the following:

example-3swap2

And this makes sense: the gate changes any entry of the state vector that has values for the first and third qubit that are different. This is what happens to our state:

\displaystyle v = \frac{1}{\sqrt{14}} (0,1,0,2,0,-3,0,0) \mapsto \frac{1}{\sqrt{14}} (0,0,0,0,1,-3,2,0)

Perspective 2: just assume every operation works on the first three qubits, and wrap each operation A in between an operation that swaps the first three qubits with the desired three. So like BAB for B a swap operation. Then the matrix form looks a bit simpler, and it just means we permute the columns of the matrix form we gave above so that it just has the form A \otimes I_a. This allows one to retain a shred of sanity when trying to envision the matrix for an operation that acts on three qubits that are not sequential. The downside is that to actually use this perspective in an analysis you have to carry around the extra baggage of these permutation matrices. So one might use this as a simplifying assumption (a “without loss of generality” statement).

Perspective 3: ignore matrices and write things down in a summation form. So if \sigma is the permutation that swaps 1 and 3 and leaves the other indices unchanged, we can write the general operation on a state v = \sum_{x \in \{ 0,1 \}^n } a_x e_{x} as Av = \sum_{x \in \{ 0,1 \}^n} a_x e_{\sigma(x)}.

The third option is probably the nicest way to do things, but it’s important to keep the matrix view in mind for many reasons. Just one quick reason: “errors” in quantum gates (that are meant to approximately compute something) compound linearly in the number of gates because the operations are linear. This is a key reason that allows one to design quantum analogues of error correcting codes.

So we’ve established that the basic (atomic) quantum gates are “local” in the sense that they operate on a fixed number of qubits, but they are not local in the sense that they can screw up the entire state vector.

A side note on the meaning of “local”

When I was chugging through learning this stuff (and I still have far to go), I wanted to come up with an alternate characterization of the word “local” so that I would feel better about using the word “local.” Mathematicians are as passionate about word choice as programmers are about text editors. In particular, for a long time I was ignorantly convinced that quantum gates that act on a small number of qubits don’t affect the marginal distribution of measurement outcomes for other qubits. That is, I thought that if A acts on qubits 1,2,3, then Av and v have the same probability of a measurement producing a 1 in index 4, 5, etc, conditioned on fixing a measurement outcome for qubits 1,2,3. In notation, if x is a random variable whose values are binary strings and v is a state vector, I’ll call x \sim v the random process of measuring a state vector v and getting a string x, then my claim was that the following was true for every b_1, b_2, b_3 \in \{0,1\} and every 1 \leq i \leq n:

\displaystyle \begin{aligned}\Pr_{x \sim v}&[x_i = 1 \mid x_1 = b_1, x_2 = b_2, x_3 = b_3] = \\ \Pr_{x \sim Av}&[x_i = 1 \mid x_1 = b_1, x_2 = b_2, x_3 = b_3] \end{aligned}

You could try to prove this, and you would fail because it’s false. In fact, it’s even false if A acts on only a single qubit! Because it’s so tedious to write out all of the notation, I decided to write a program to illustrate the counterexample. (The most brazenly dedicated readers will try to prove this false fact and identify where the proof fails.)

import numpy

H = (1/(2**0.5)) * numpy.array([[1,1], [1,-1]])
I = numpy.identity(4)
A = numpy.kron(H,I)

Here H is the 2 by 2 Hadamard matrix, which operates on a single qubit and maps e_0 \mapsto \frac{e_0 + e_1}{\sqrt{2}}, and e_1 \mapsto \frac{e_0 - e_1}{\sqrt{2}}. This matrix is famous for many reasons, but one simple use as a quantum gate is to generate uniform random coin flips. In particular, measuring He_0 outputs 1 and 0 with equal probability.

So in the code sample above, A is the mapping which applies the Hadamard operation to the first qubit and leaves the other qubits alone.

Then we compute some arbitrary input state vector w

def normalize(z):
   return (1.0 / (sum(abs(z)**2) ** 0.5)) * z

v = numpy.arange(1,9)
w = normalize(v)

And now we write a function to compute the probability of some query conditioned on some fixed bits. We simply sum up the square norms of all of the relevant indices in the state vector.

def condProb(state, query={}, fixed={}):
   num = 0
   denom = 0
   dim = int(math.log2(len(state)))

   for x in itertools.product([0,1], repeat=dim):
      if any(x[index] != b for (index,b) in fixed.items()):
         continue

      i = sum(d << i for (i,d) in enumerate(reversed(x)))
      denom += abs(state[i])**2
      if all(x[index] == b for (index, b) in query.items()):
         num += abs(state[i]) ** 2

   if num == 0:
      return 0 

   return num / denom

So if the query is query = {1:0} and the fixed thing is fixed = {0:0}, then this will compute the probability that the measurement results in the second qubit being zero conditioned on the first qubit also being zero.

And the result:

Aw = A.dot(w)
query = {1:0}
fixed = {0:0}
print((condProb(w, query, fixed), condProb(Aw, query, fixed)))
# (0.16666666666666666, 0.29069767441860467)

So they are not equal in general.

Also, in general we won’t work explicitly with full quantum gate matrices, since for n qubits the have size 2^{2n} which is big. But for finding counterexamples to guesses and false intuition, it’s a great tool.

Some important gates on 1-3 qubits

Let’s close this post with concrete examples of quantum gates. Based on the above discussion, we can write out the 2 x 2 or 4 x 4 matrix form of the operation and understand that it can apply to any two qubits in the state of a quantum program. Gates are most interesting when they’re operating on entangled qubits, and that will come out when we visit our first quantum algorithm next time, but for now we will just discuss at a naive level how they operate on the basis vectors.

Hadamard gate:

We introduced the Hadamard gate already, but I’ll reiterate it here.

Let H be the following 2 by 2 matrix, which operates on a single qubit and maps e_0 \mapsto \frac{e_0 + e_1}{\sqrt{2}}, and e_1 \mapsto \frac{e_0 - e_1}{\sqrt{2}}.

\displaystyle H = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}

One can use H to generate uniform random coin flips. In particular, measuring He_0 outputs 1 and 0 with equal probability.

Quantum NOT gate:

Let X be the 2 x 2 matrix formed by swapping the columns of the identity matrix.

\displaystyle X = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}

This gate is often called the “Pauli-X” gate by physicists. This matrix is far too simple to be named after a person, and I can only imagine it is still named after a person for the layer of obfuscation that so often makes people feel smarter (same goes for the Pauli-Y and Pauli-Z gates, but we’ll get to those when we need them).

If we’re thinking of e_0 as the boolean value “false” and e_1 as the boolean value “true”, then the quantum NOT gate simply swaps those two states. In particular, note that composing a Hadamard and a quantum NOT gate can have interesting effects: XH(e_0) = H(e_0), but XH(e_1) \neq H(e_1). In the second case, the minus sign is the culprit. Which brings us to…

Phase shift gate:

Given an angle \theta, we can “shift the phase” of one qubit by an angle of \theta using the 2 x 2 matrix R_{\theta}.

\displaystyle R_{\theta} = \begin{pmatrix} 1 & 0 \\ 0 & e^{i \theta} \end{pmatrix}

“Phase” is a term physicists like to use for angles. Since the coefficients of a quantum state vector are complex numbers, and since complex numbers can be thought of geometrically as vectors with direction and magnitude, it makes sense to “rotate” the coefficient of a single qubit. So R_{\theta} does nothing to e_0 and it rotates the coefficient of e_1 by an angle of \theta.

Continuing in our theme of concreteness, if I have the state vector v = \frac{1}{\sqrt{204}} (1,2,3,4,5,6,7,8) and I apply a rotation of pi to the second qubit, then my operation is the matrix I_2 \otimes R_{\pi} \otimes I_2 which maps e_{i0k} \mapsto e_{i0k} and e_{i1k} \mapsto -e_{i1k}. That would map the state v to (1,2,-3,-4,5,6,-7,-8).

If we instead used the rotation by \pi/2 we would get the output state (1,2,3i, 4i, 5, 6, 7i, 8i).

Quantum AND/OR gate:

In the last post in this series we gave the quantum AND gate and left the quantum OR gate as an exercise. Rather than write out the matrix again, let me remind you of this gate using a description of the effect on the basis e_{ijk} where i,j,k \in \{ 0,1 \}. Recall that we need three qubits in order to make the operation reversible (which is a consequence of all unitary gates being unitary matrices). Some notation: \oplus is the XOR of two bits, and \wedge is AND, \vee is OR. The quantum AND gate maps

\displaystyle e_{ijk} \mapsto e_{ij(k \oplus (i \wedge j))}

In words, the third coordinate is XORed with the AND of the first two coordinates. We think of the third coordinate as a “scratchwork” qubit which is maybe prepared ahead of time to be in state zero.

Simiarly, the quantum OR gate maps e_{ijk} \mapsto e_{ij(k \oplus (i \vee j))}. As we saw last time these combined with the quantum NOT gate (and some modest number of scratchwork qubits) allows quantum circuits to simulate any classical circuit.

Controlled-* gate:

The last example in this post is a meta-gate that represents a conditional branching. If we’re given a gate A acting on k qubits, then we define the controlled-A to be an operation which acts on k+1 qubits. Let’s call the added qubit “qubit zero.” Then controlled-A does nothing if qubit zero is in state 0, and applies A if qubit zero is in state 1. Qubit zero is generally called the “control qubit.”

The matrix representing this operation decomposes into blocks if the control qubit is actually the first qubit (or you rearrange).

\displaystyle \begin{pmatrix} I_{2^{k-1}} & 0 \\ 0 & A \end{pmatrix}

A common example of this is the controlled-NOT gate, often abbreviated CNOT, and it has the matrix

\displaystyle \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{pmatrix}

Looking forward

Okay let’s take a step back and evaluate our life choices. So far we’ve spent a few hours of our time motivating quantum computing, explaining the details of qubits and quantum circuits, and seeing examples of concrete quantum gates and studying measurement. I’ve hopefully hammered into your head the notion that quantum states which aren’t pure tensors (i.e. entangled) are where the “weirdness” of quantum computing comes from. But we haven’t seen any examples of quantum algorithms yet!

Next time we’ll see our first example of an algorithm that is genuinely quantum. We won’t tackle factoring yet, but we will see quantum “weirdness” in action.

Until then!

The Welch-Berlekamp Algorithm for Correcting Errors in Data

In this post we’ll implement Reed-Solomon error-correcting codes and use them to play with codes. In our last post we defined Reed-Solomon codes rigorously, but in this post we’ll focus on intuition and code. As usual the code and data used in this post is available on this blog’s Github page.

The main intuition behind Reed-Solomon codes (and basically all the historically major codes) is

Error correction is about adding redundancy, and polynomials are a really efficient way to do that.

Here’s an example of what we’ll do in the post. Say you have a space probe flying past Mars taking photographs like this one

mars

Courtesy of NASA’s Viking Orbiter.

Unfortunately you know that if you send the images back to Earth via radio waves, the signal will get corrupted by cosmic something-or-other and you’ll end up with an image like this.

corrupted-mars

How can you recover from errors like this? You could do something like repeat each pixel twice in the message so that if one is corrupted the other will get through. But still, every now and then both pixels in a row will be corrupted and it’s twice as inefficient.

The idea of error-correcting codes is to find a way to encode a message so that it adds a lot of redundancy without adding too much extra information to the message. The name of the game is to optimize the tradeoff between how much redundancy you get and how much longer the message needs to be, while still being able to efficiently decode the encoded message.

A solid technique turns out to be: use polynomials. Even though you’d think polynomials are too simple (we teach them starting in the 7th grade these days!) they turn out to have remarkable properties. The most important of which is:

if you give me a bunch of points in the plane with different x coordinates, they uniquely define a polynomial of a certain degree.

This fact is called polynomial interpolation. We used it in a previous post to share secrets, if you’re interested.

What makes polynomials great for error correction is that you can take a fixed polynomial (think, the message) and “encode” it as a list of points on that polynomial. If you include enough, then you can get back the original polynomial from the points alone. And the best part, for each two additional points you include above the minimum, you get resilience to one additional error no matter where it happens in the message. Another way to say this is, even if some of the points in your encoded message are wrong (the numbers are modified by an adversary or random noise), as long as there aren’t too many errors there is an algorithm that can recover the errors.

That’s what makes polynomials so much better than the naive idea of repeating every pixel twice: once you allow for three errors you run the risk of losing a pixel, but you had to double your communication costs. With a polynomial-based approach you’d only need to store around six extra pixels worth of data to get resilience to three errors that can happen anywhere. What a bargain!

Here’s the official theorem about Reed-Solomon codes:

Theorem: There is an efficient algorithm which, when given points (a_1, b_1), \dots, (a_n, b_n) with distinct a_i has the following property. If there is a polynomial of degree d that passes through at least n/2 + d/2 of the given points, then the algorithm will output the polynomial.

So let’s implement the encoder, decoder, and turn the theorem into code!

Implementing the encoder

The way you write a message of length k as a polynomial is easy. Pick a large prime integer p and from now on we’ll do all our arithmetic modulo p. Then encode each character c_0, c_1, \dots, c_{k-1} in the message as an integer between 0 and p-1 (this is why p needs to be large enough), and the polynomial representing the message is

m(x) = c_0 + c_1x + c_2x^2 + \dots + c_{k-1}x^{k-1}

If the message has length k then the polynomial will have degree k-1.

Now to encode the message we just pick a bunch of x values and plug them into the polynomial, and record the (input, output) pairs as the encoded message. If we want to make things simple we can just require that you always pick the x values 0, 1, \dots, n for some choice of n \leq p.

A quick skippable side-note: we need p to be prime so that our arithmetic happens in a field. Otherwise, we won’t necessarily get unique decoded messages.

Back when we discussed elliptic curve cryptography (ironically sharing an acronym with error correcting codes), we actually wrote a little library that lets us seamlessly represent polynomials with “modular arithmetic coefficients” in Python, which in math jargon is a “finite field.” Rather than reinvent the wheel we’ll just use that code as a black box (full source in the Github repo). Here are some examples of using it.

>>> from finitefield.finitefield import FiniteField
>>> F13 = FiniteField(p=13)
>>> a = F13(7)
>>> a+9
3 (mod 13)
>>> a*a
10 (mod 13)
>>> 1/a
2 (mod 13)

A programming aside: once you construct an instance of your finite field, all arithmetic operations involving instances of that type will automatically lift integers to the appropriate type. Now to make some polynomials:

>>> from finitefield.polynomial import polynomialsOver
>>> F = FiniteField(p=13)
>>> P = polynomialsOver(F)
>>> g = P([1,3,5])
>>> g
1 + 3 t^1 + 5 t^2
>>> g*g
1 + 6 t^1 + 6 t^2 + 4 t^3 + 12 t^4
>>> g(100)
4 (mod 13)

Now to fix an encoding/decoding scheme we’ll call k the size of the unencoded message, n the size of the encoded message, and p the modulus, and we’ll fix these programmatically when the encoder and decoder are defined so we don’t have to keep carrying these data around.

def makeEncoderDecoder(n, k, p):
   Fp = FiniteField(p)
   Poly = polynomialsOver(Fp)

   def encode(message):
      ...
      
   def decode(encodedMessage):
      ...

   return encode, decode

Encode is the easier of the two.

def encode(message):
   thePoly = Poly(message)
   return [(Fp(i), thePoly(Fp(i))) for i in range(n)]

Technically we could remove the leading Fp(i) from each tuple, since the decoder algorithm can assume we’re using the first n integers in order. But we’ll leave it in and define the decode function more generically.

After we define how the decoder should work in theory we’ll run through a simple example step by step. Now on to the decoder.

The decoding algorithm, Berlekamp-Welch

There are a lot of different decoding algorithms for various error correcting codes. The one we’ll implement is called the Berlekamp-Welch algorithm, but before we get to it we should mention a much simpler algorithm that will work when there are only a few errors.

To remind us of notation, call k the length of the message, so that k-1 is the degree of the polynomial we used to encode it. And n is the number of points we used in the encoding. Call the encoded message M as it’s received (as a list of points, possibly with errors).

In the simple method what you do is just randomly pick k points from M, do polynomial interpolation on the chosen points to get some polynomial g, and see if g agrees with most of the points in M. If there really are few errors, then there’s a good chance the randomly chosen points won’t have any errors in them and you’ll win. If you get unlucky and pick some points with errors, then the g you get won’t agree with most of M and you can throw it out and try again. If you get really unlucky and a bad g does agree with most of M, then you just run this procedure a few hundred times and take the g you get most often. But again, this only works with a small number of errors and while it could be good enough for many applications, don’t bet your first-born child’s life on it working. Or even your favorite pencil, for that matter. We’re going to implement Berlekamp-Welch so you can win someone else’s favorite pencil. You’re welcome.

Exercise: Implement the simple decoding algorithm and test it on some data.

Suppose we are guaranteed that there are exactly e < \frac{n-k+1}{2} errors in our received message M = (a_1, b_1, \dots, a_n, b_n). Call the polynomial that represents the original message P. In other words, we have that P(a_i) = b_i for all but e of the points in M.

There are two key ingredients in the algorithm. The first is called the error locator polynomial. We’ll call this polynomial E(x), and it’s just defined by being zero wherever the errors occurred. In symbols, E(a_i) = 0 whenever P(a_i) \neq b_i. If we knew where the errors occurred, we could write out E(x) explicitly as a product of terms like (x-a_i). And if we knew E we’d also be done, because it would tell us where the errors were and we could do interpolation on all the non-error points in M.

So we’re going to have to study E indirectly and use it to get P. One nice property of E(x) is the following

\displaystyle b_i E(a_i) = P(a_i)E(a_i),

which is true for every pair (a_i, b_i) \in M. Indeed, by definition when P(a_i) \neq b_i then E(a_i) = 0 so both sides are zero. Now we can use a technique called linearization. It goes like this. The product P(x) E(x), i.e. the right-hand-side of the above equation, is a polynomial, say Q(x), of larger degree (e + k - 1). We get the equation for all i:

\displaystyle b_i E(a_i) = Q(a_i)

Now E, Q, and P are all unknown, but it turns out that we can actually find E and Q efficiently. Or rather, we can’t guarantee we’ll find E and Q exactly, instead we’ll find two polynomials that have the same quotient as Q(x)/E(x) = P(x). Here’s how that works.

Say we wrote out E(x) as a generic polynomial of degree e and Q(x) as a generic polynomial of degree e+k-1. So their coefficients are unspecified variables. Now we can plug in all the points a_i, b_i to the equations b_i E(a_i) = Q(a_i), and this will form a linear system of 2e + k-1 unknowns (e unknowns come from E(x) and e+k-1 come from Q(x)).

Now we know that this system has good solution, because if we take the true error locator polynomial and Q = E(x)P(x) with the true P(x) we win. The worry is that we’ll solve this system and get two different polynomials Q', E' whose quotient will be something crazy and unrelated to P. But as it turns out this will never happen, and any solution will give the quotient P. Here’s a proof you can skip if you hate proofs.

Proof. Say you have two pairs of solutions to the system, (Q_1, E_1) and (Q_2, E_2), and you want to show that Q_1/E_1 = Q_2/E_2. Well, they might not be divisible, but we can multiply the previous equation through to get Q_1E_2 = Q_2E_1. Now we show two polynomials are equal in the same way as always: subtract and show there are too many roots. Define R(x) = Q_1E_2 - Q_2E_1. The claim is that R(x) has n roots, one for every point (a_i, b_i). Indeed,

\displaystyle R(a_i) = (b_i E_1(a_i))E_2(a_i) - (b_iE_2(a_i)) E_1(a_i) = 0

But the degree of R(x) is 2e + k - 1 which is less than n by the assumption that e < \frac{n-k+1}{2}. So R(x) has too many roots and must be the zero polynomial, and the two quotients are equal.

\square

So the core python routine is just two steps: solve the linear equation, and then divide two polynomials. However, it turns out that no python module has any decent support for solving linear systems of equations over finite fields.  Luckily, I wrote a linear solver way back when and so we’ll adapt it to our purposes. I’ll leave out the gory details of the solver itself, but you can see them in the source for this post. Here is the code that sets up the system

   def solveSystem(encodedMessage):
      for e in range(maxE, 0, -1):
         ENumVars = e+1
         QNumVars = e+k
         def row(i, a, b):
            return ([b * a**j for j in range(ENumVars)] +
                    [-1 * a**j for j in range(QNumVars)] +
                    [0]) # the "extended" part of the linear system

         system = ([row(i, a, b) for (i, (a,b)) in enumerate(encodedMessage)] +
                   [[0] * (ENumVars-1) + [1] + [0] * (QNumVars) + [1]])
                     # ensure coefficient of x^e in E(x) is 1

         solution = someSolution(system, freeVariableValue=1)
         E = Poly([solution[j] for j in range(e + 1)])
         Q = Poly([solution[j] for j in range(e + 1, len(solution))])

         P, remainder = Q.__divmod__(E)
         if remainder == 0:
            return Q, E

      raise Exception("found no divisors!")

   def decode(encodedMessage):
      Q,E = solveSystem(encodedMessage)

      P, remainder = Q.__divmod__(E)
      if remainder != 0:
         raise Exception("Q is not divisibly by E!")

      return P.coefficients

A simple example

Now let’s go through an extended example with small numbers. Let’s work modulo 7 and say that our message is

2, 3, 2 (mod 7)

In particular, k=3 is the length of the message. We’ll encode it as a polynomial in the way we described:

\displaystyle m(x) = 2 + 3x + 2x^2 (\mod 7)

If we pick n = 5, then we will encode the message as a sequence of five points on m(x), namely m(0) through m(4).

[[0, 2], [1, 0], [2, 2], [3, 1], [4, 4]] (mod 7)

Now let’s add a single error. First remember that our theoretical guarantee says that we can correct any number of errors up to \frac{n+k-1}{2} - 1, which in this case is [(5+3-1) / 2] - 1 = 2, so we can definitely correct one error. We’ll add 1 to the third point, giving the received corrupted message as

[[0, 2], [1, 0], [2, 3], [3, 1], [4, 4]] (mod 7)

Now we set up the system of equations b_i E(a_i) = Q(a_i) for all (a_i, b_i) above. Rewriting the equations as b_iE(a_i) - Q(a_i) = 0, and adding as the last equation the constraint that x^e = 1. The columns represent the variables, with the last column being the right-hand-side of the equality as is the standard for Gaussian elimination.

# e0 e1 q0 q1 q2 q3  
[
  [2, 0, 6, 0, 0, 0, 0],
  [0, 0, 6, 6, 6, 6, 0],
  [3, 6, 6, 5, 3, 6, 0],
  [1, 3, 6, 4, 5, 1, 0],
  [4, 2, 6, 3, 5, 6, 0],
  [0, 1, 0, 0, 0, 0, 1],
]

Then we do row-reduction to get

[
  [1, 0, 0, 0, 0, 0, 5],
  [0, 1, 0, 0, 0, 0, 1],
  [0, 0, 1, 0, 0, 0, 3],
  [0, 0, 0, 1, 0, 0, 3],
  [0, 0, 0, 0, 1, 0, 6],
  [0, 0, 0, 0, 0, 1, 2]
]

And reading off the solution gives E(x) = 5 + x and Q(x) = 2 + 2x + 6x^2 + 2x^3. Note in particular that the E(x) given in this solution is not the error locator polynomial! Nevertheless, the quotient of the two polynomials is exactly m(x) = 2 + 3x + 2x^2 which gives back the original message.

There is one catch here: how does one determine the value of e to use in setting up the system of linear equations? It turns out that an upper bound on e will work just fine, so long as the upper bound you use agrees with the theoretical maximum number of errors allowed (see the Singleton bound from last time). The effect of doing this is that the linear system ends up with some number of free variables that you can set to arbitrary values, and these will correspond to additional shared roots of E(x) and Q(x) that cancel out upon dividing.

A larger example

Now it’s time for a sad fact. I tried running Welch-Berlekamp on an encoded version of the following tiny image:

tiny-mars

And it didn’t finish after running all night.

Berlekamp-Welch is a slow algorithm for decoding Reed-Solomon codes because it requires one to solve a large system of equations. There’s at least one equation for each pixel in a black and white image! To get around this one typically encodes blocks of pixels together into one message character (since p is larger than n > k there is lots of space), and apparently one can balance it to minimize the number of equations. And finally, a nontrivial inefficiency comes from our implementation of everything in Python without optimizations. If we rewrote everything in C++ or Go and fixed the prime modulus, we would likely see reasonable running times. There are also asymptotically much faster methods based on the fast Fourier transform, and in the future we’ll try implementing some of these. For the dedicated reader, these are all good follow-up projects.

For now we’ll just demonstrate that it works by running it on a larger sample of text, the introductory paragraphs of To Kill a Mockingbird:

def tkamTest():
   message = '''When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.  When it healed, and Jem's fears of never being able to play football were assuaged, he was seldom   self-conscious about his injury. His left arm was somewhat shorter than his right; when he stood or walked, the back of his hand was at right angles to his body, his thumb parallel to his thigh. He   couldn't have cared less, so long as he could pass and punt.'''

   k = len(message)
   n = len(message) * 2
   p = 2087
   integerMessage = [ord(x) for x in message]

   enc, dec, solveSystem = makeEncoderDecoder(n, k, p)
   print("encoding...")
   encoded = enc(integerMessage)

   e = int(k/2)
   print("corrupting...")
   corrupted = corrupt(encoded[:], e, 0, p)

   print("decoding...")
   Q,E = solveSystem(corrupted)
   P, remainder = (Q.__divmod__(E))

   recovered = ''.join([chr(x) for x in P.coefficients])
   print(recovered)

Running this with unix time produces the following:

encoding...
corrupting...
decoding...
When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. When it healed, and Jem's fears of never being able to play football were assuaged, he was seldom self-conscious about his injury. His left arm was somewhat shorter than his right; when he stood or walked, the back of his hand was at right angles to his body, his thumb parallel to his thigh. He couldn't have cared less, so long as he could pass and punt.

real	82m9.813s
user	81m18.891s
sys	0m27.404s

So it finishes in “only” an hour or so.

In any case, the decoding algorithm is an interesting one. In future posts we’ll explore more efficient algorithms and faster implementations.

Until then!

Markov Chain Monte Carlo Without all the Bullshit

I have a little secret: I don’t like the terminology, notation, and style of writing in statistics. I find it unnecessarily complicated. This shows up when trying to read about Markov Chain Monte Carlo methods. Take, for example, the abstract to the Markov Chain Monte Carlo article in the Encyclopedia of Biostatistics.

Markov chain Monte Carlo (MCMC) is a technique for estimating by simulation the expectation of a statistic in a complex model. Successive random selections form a Markov chain, the stationary distribution of which is the target distribution. It is particularly useful for the evaluation of posterior distributions in complex Bayesian models. In the Metropolis–Hastings algorithm, items are selected from an arbitrary “proposal” distribution and are retained or not according to an acceptance rule. The Gibbs sampler is a special case in which the proposal distributions are conditional distributions of single components of a vector parameter. Various special cases and applications are considered.

I can only vaguely understand what the author is saying here (and really only because I know ahead of time what MCMC is). There are certainly references to more advanced things than what I’m going to cover in this post. But it seems very difficult to find an explanation of Markov Chain Monte Carlo without all any superfluous jargon. The “bullshit” here is the implicit claim of an author that such jargon is needed. Maybe it is to explain advanced applications (like attempts to do “inference in Bayesian networks”), but it is certainly not needed to define or analyze the basic ideas.

So to counter, here’s my own explanation of Markov Chain Monte Carlo, inspired by the treatment of John Hopcroft and Ravi Kannan.

The Problem is Drawing from a Distribution

Markov Chain Monte Carlo is a technique to solve the problem of sampling from a complicated distribution. Let me explain by the following imaginary scenario. Say I have a magic box which can estimate probabilities of baby names very well. I can give it a string like “Malcolm” and it will tell me the exact probability p_{\textup{Malcolm}} that you will choose this name for your next child. So there’s a distribution D over all names, it’s very specific to your preferences, and for the sake of argument say this distribution is fixed and you don’t get to tamper with it.

Now comes the problem: I want to efficiently draw a name from this distribution D. This is the problem that Markov Chain Monte Carlo aims to solve. Why is it a problem? Because I have no idea what process you use to pick a name, so I can’t simulate that process myself. Here’s another method you could try: generate a name x uniformly at random, ask the machine for p_x, and then flip a biased coin with probability p_x and use x if the coin lands heads. The problem with this is that there are exponentially many names! The variable here is the number of bits needed to write down a name n = |x|. So either the probabilities p_x will be exponentially small and I’ll be flipping for a very long time to get a single name, or else there will only be a few names with nonzero probability and it will take me exponentially many draws to find them. Inefficiency is the death of me.

So this is a serious problem! Let’s restate it formally just to be clear.

Definition (The sampling problem):  Let D be a distribution over a finite set X. You are given black-box access to the probability distribution function p(x) which outputs the probability of drawing x \in X according to D. Design an efficient randomized algorithm A which outputs an element of X so that the probability of outputting x is approximately p(x). More generally, output a sample of elements from X drawn according to p(x).

Assume that A has access to only fair random coins, though this allows one to efficiently simulate flipping a biased coin of any desired probability.

Notice that with such an algorithm we’d be able to do things like estimate the expected value of some random variable f : X \to \mathbb{R}. We could take a large sample S \subset X via the solution to the sampling problem, and then compute the average value of f on that sample. This is what a Monte Carlo method does when sampling is easy. In fact, the Markov Chain solution to the sampling problem will allow us to do the sampling and the estimation of \mathbb{E}(f) in one fell swoop if you want.

But the core problem is really a sampling problem, and “Markov Chain Monte Carlo” would be more accurately called the “Markov Chain Sampling Method.” So let’s see why a Markov Chain could possibly help us.

Random Walks, the “Markov Chain” part of MCMC

Markov Chain is essentially a fancy term for a random walk on a graph.

You give me a directed graph G = (V,E), and for each edge e = (u,v) \in E you give me a number p_{u,v} \in [0,1]. In order to make a random walk make sense, the p_{u,v} need to satisfy the following constraint:

For any vertex x \in V, the set all values p_{x,y} on outgoing edges (x,y) must sum to 1, i.e. form a probability distribution.

If this is satisfied then we can take a random walk on G according to the probabilities as follows: start at some vertex x_0. Then pick an outgoing edge at random according to the probabilities on the outgoing edges, and follow it to x_1. Repeat if possible.

I say “if possible” because an arbitrary graph will not necessarily have any outgoing edges from a given vertex. We’ll need to impose some additional conditions on the graph in order to apply random walks to Markov Chain Monte Carlo, but in any case the idea of randomly walking is well-defined, and we call the whole object (V,E, \{ p_e \}_{e \in E})Markov chain.

Here is an example where the vertices in the graph correspond to emotional states.

An example Markov chain [image source http://www.mathcs.emory.edu/~cheung/]

An example Markov chain; image source http://www.mathcs.emory.edu/~cheung/

In statistics land, they take the “state” interpretation of a random walk very seriously. They call the edge probabilities “state-to-state transitions.”

The main theorem we need to do anything useful with Markov chains is the stationary distribution theorem (sometimes called the “Fundamental Theorem of Markov Chains,” and for good reason). What it says intuitively is that for a very long random walk, the probability that you end at some vertex v is independent of where you started! All of these probabilities taken together is called the stationary distribution of the random walk, and it is uniquely determined by the Markov chain.

However, for the reasons we stated above (“if possible”), the stationary distribution theorem is not true of every Markov chain. The main property we need is that the graph G is strongly connected. Recall that a directed graph is called connected if, when you ignore direction, there is a path from every vertex to every other vertex. It is called strongly connected if you still get paths everywhere when considering direction. If we additionally require the stupid edge-case-catcher that no edge can have zero probability, then strong connectivity (of one component of a graph) is equivalent to the following property:

For every vertex v \in V(G), an infinite random walk started at v will return to v with probability 1.

In fact it will return infinitely often. This property is called the persistence of the state v by statisticians. I dislike this term because it appears to describe a property of a vertex, when to me it describes a property of the connected component containing that vertex. In any case, since in Markov Chain Monte Carlo we’ll be picking the graph to walk on (spoiler!) we will ensure the graph is strongly connected by design.

Finally, in order to describe the stationary distribution in a more familiar manner (using linear algebra), we will write the transition probabilities as a matrix A where entry a_{j,i} = p_{(i,j)} if there is an edge (i,j) \in E and zero otherwise. Here the rows and columns correspond to vertices of G, and each column i forms the probability distribution of going from state i to some other state in one step of the random walk. Note A is the transpose of the weighted adjacency matrix of the directed weighted graph G where the weights are the transition probabilities (the reason I do it this way is because matrix-vector multiplication will have the matrix on the left instead of the right; see below).

This matrix allows me to describe things nicely using the language of linear algebra. In particular if you give me a basis vector e_i interpreted as “the random walk currently at vertex i,” then Ae_i gives a vector whose j-th coordinate is the probability that the random walk would be at vertex j after one more step in the random walk. Likewise, if you give me a probability distribution q over the vertices, then Aq gives a probability vector interpreted as follows:

If a random walk is in state i with probability q_i, then the j-th entry of Aq is the probability that after one more step in the random walk you get to vertex j.

Interpreted this way, the stationary distribution is a probability distribution \pi such that A \pi = \pi, in other words \pi is an eigenvector of A with eigenvalue 1.

A quick side note for avid readers of this blog: this analysis of a random walk is exactly what we did back in the early days of this blog when we studied the PageRank algorithm for ranking webpages. There we called the matrix A “a web matrix,” noted it was column stochastic (as it is here), and appealed to a special case of the Perron-Frobenius theorem to show that there is a unique maximal eigenvalue equal to one (with a dimension one eigenspace) whose eigenvector we used as a sort of “stationary distribution” and the final ranking of web pages. There we described an algorithm to actually find that eigenvector by iterated multiplication by A. The following theorem is essentially a variant of this algorithm but works under weaker conditions; for the web matrix we added additional “fake” edges that give the needed stronger conditions.

Theorem: Let G be a strongly connected graph with associated edge probabilities \{ p_e \}_e \in E forming a Markov chain. For a probability vector x_0, define x_{t+1} = Ax_t for all t \geq 1, and let v_t be the long-term average v_t = \frac1t \sum_{s=1}^t x_s. Then:

  1. There is a unique probability vector \pi with A \pi = \pi.
  2. For all x_0, the limit \lim_{t \to \infty} v_t = \pi.

Proof. Since v_t is a probability vector we just want to show that |Av_t - v_t| \to 0 as t \to \infty. Indeed, we can expand this quantity as

\displaystyle \begin{aligned} Av_t - v_t &=\frac1t (Ax_0 + Ax_1 + \dots + Ax_{t-1}) - \frac1t (x_0 + \dots + x_{t-1}) \\ &= \frac1t (x_t - x_0) \end{aligned}

But x_t, x_0 are unit vectors, so their difference is at most 2, meaning |Av_t - v_t| \leq \frac2t \to 0. Now it’s clear that this does not depend on v_0. For uniqueness we will cop out and appeal to the Perron-Frobenius theorem that says any matrix of this form has a unique such (normalized) eigenvector.

\square

One additional remark is that, in addition to computing the stationary distribution by actually computing this average or using an eigensolver, one can analytically solve for it as the inverse of a particular matrix. Define B = A-I_n, where I_n is the n \times n identity matrix. Let C be B with a row of ones appended to the bottom and the topmost row removed. Then one can show (quite opaquely) that the last column of C^{-1} is \pi. We leave this as an exercise to the reader, because I’m pretty sure nobody uses this method in practice.

One final remark is about why we need to take an average over all our x_t in the theorem above. There is an extra technical condition one can add to strong connectivity, called aperiodicity, which allows one to beef up the theorem so that x_t itself converges to the stationary distribution. Rigorously, aperiodicity is the property that, regardless of where you start your random walk, after some sufficiently large number of steps n the random walk has a positive probability of being at every vertex at every subsequent step. As an example of a graph where aperiodicity fails: an undirected cycle on an even number of vertices. In that case there will only be a positive probability of being at certain vertices every other step, and averaging those two long term sequences gives the actual stationary distribution.

Screen Shot 2015-04-07 at 6.55.39 PM

Image source: Wikipedia

One way to guarantee that your Markov chain is aperiodic is to ensure there is a positive probability of staying at any vertex. I.e., that your graph has a self-loop. This is what we’ll do in the next section.

Constructing a graph to walk on

Recall that the problem we’re trying to solve is to draw from a distribution over a finite set X with probability function p(x). The MCMC method is to construct a Markov chain whose stationary distribution is exactly p, even when you just have black-box access to evaluating p. That is, you (implicitly) pick a graph G and (implicitly) choose transition probabilities for the edges to make the stationary distribution p. Then you take a long enough random walk on G and output the x corresponding to whatever state you land on.

The easy part is coming up with a graph that has the right stationary distribution (in fact, “most” graphs will work). The hard part is to come up with a graph where you can prove that the convergence of a random walk to the stationary distribution is fast in comparison to the size of X. Such a proof is beyond the scope of this post, but the “right” choice of a graph is not hard to understand.

The one we’ll pick for this post is called the Metropolis-Hastings algorithm. The input is your black-box access to p(x), and the output is a set of rules that implicitly define a random walk on a graph whose vertex set is X.

It works as follows: you pick some way to put X on a lattice, so that each state corresponds to some vector in \{ 0,1, \dots, n\}^d. Then you add (two-way directed) edges to all neighboring lattice points. For n=5, d=2 it would look like this:

And for d=3, n \in \{2,3\} it would look like this:

You have to be careful here to ensure the vertices you choose for X are not disconnected, but in many applications X is naturally already a lattice.

Now we have to describe the transition probabilities. Let r be the maximum degree of a vertex in this lattice (r=2d). Suppose we’re at vertex i and we want to know where to go next. We do the following:

  1. Pick neighbor j with probability 1/r (there is some chance to stay at i).
  2. If you picked neighbor j and p(j) \geq p(i) then deterministically go to j.
  3. Otherwise, p(j) < p(i), and you go to j with probability p(j) / p(i).

We can state the probability weight p_{i,j} on edge (i,j) more compactly as

\displaystyle p_{i,j} = \frac1r \min(1, p(j) / p(i)) \\ p_{i,i} = 1 - \sum_{(i,j) \in E(G); j \neq i} p_{i,j}

It is easy to check that this is indeed a probability distribution for each vertex i. So we just have to show that p(x) is the stationary distribution for this random walk.

Here’s a fact to do that: if a probability distribution v with entries v(x) for each x \in X has the property that v(x)p_{x,y} = v(y)p_{y,x} for all x,y \in X, the v is the stationary distribution. To prove it, fix x and take the sum of both sides of that equation over all y. The result is exactly the equation v(x) = \sum_{y} v(y)p_{y,x}, which is the same as v = Av. Since the stationary distribution is the unique vector satisfying this equation, v has to be it.

Doing this with out chosen p(i) is easy, since p(i)p_{i,j} and p(i)p_{j,i} are both equal to \frac1r \min(p(i), p(j)) by applying a tiny bit of algebra to the definition. So we’re done! One can just randomly walk according to these probabilities and get a sample.

Last words

The last thing I want to say about MCMC is to show that you can estimate the expected value of a function \mathbb{E}(f) simultaneously while random-walking through your Metropolis-Hastings graph (or any graph whose stationary distribution is p(x)). By definition the expected value of f is \sum_x f(x) p(x).

Now what we can do is compute the average value of f(x) just among those states we’ve visited during our random walk. With a little bit of extra work you can show that this quantity will converge to the true expected value of f at about the same time that the random walk converges to the stationary distribution. (Here the “about” means we’re off by a constant factor depending on f). In order to prove this you need some extra tools I’m too lazy to write about in this post, but the point is that it works.

The reason I did not start by describing MCMC in terms of estimating the expected value of a function is because the core problem is a sampling problem. Moreover, there are many applications of MCMC that need nothing more than a sample. For example, MCMC can be used to estimate the volume of an arbitrary (maybe high dimensional) convex set. See these lecture notes of Alistair Sinclair for more.

If demand is popular enough, I could implement the Metropolis-Hastings algorithm in code (it wouldn’t be industry-strength, but perhaps illuminating? I’m not so sure…).

Until next time!