A Spectral Analysis of Moore Graphs

For fixed integers $r > 0$, and odd $g$, a Moore graph is an $r$-regular graph of girth $g$ which has the minimum number of vertices $n$ among all such graphs with the same regularity and girth.

(Recall, A the girth of a graph is the length of its shortest cycle, and it’s regular if all its vertices have the same degree)

Problem (Hoffman-Singleton): Find a useful constraint on the relationship between $n$ and $r$ for Moore graphs of girth $5$ and degree $r$.

Note: Excluding trivial Moore graphs with girth $g=3$ and degree $r=2$, there are only two known Moore graphs: (a) the Petersen graph and (b) this crazy graph:

The solution to the problem shows that there are only a few cases left to check.

Solution: It is easy to show that the minimum number of vertices of a Moore graph of girth $5$ and degree $r$ is $1 + r + r(r-1) = r^2 + 1$. Just consider the tree:

This is the tree example for $r = 3$, but the argument should be clear for any $r$ from the branching pattern of the tree: $1 + r + r(r-1)$

Provided $n = r^2 + 1$, we will prove that $r$ must be either $3, 7,$ or $57$. The technique will be to analyze the eigenvalues of a special matrix derived from the Moore graph.

Let $A$ be the adjacency matrix of the supposed Moore graph with these properties. Let $B = A^2 = (b_{i,j})$. Using the girth and regularity we know:

• $b_{i,i} = r$ since each vertex has degree $r$.
• $b_{i,j} = 0$ if $(i,j)$ is an edge of $G$, since any walk of length 2 from $i$ to $j$ would be able to use such an edge and create a cycle of length 3 which is less than the girth.
• $b_{i,j} = 1$ if $(i,j)$ is not an edge, because (using the tree idea above), every two vertices non-adjacent vertices have a unique neighbor in common.

Let $J_n$ be the $n \times n$ matrix of all 1’s and $I_n$ the identity matrix. Then

$\displaystyle B = rI_n + J_n - I_n - A.$

We use this matrix equation to generate two equations whose solutions will restrict $r$. Since $A$ is a real symmetric matrix is has an orthonormal basis of eigenvectors $v_1, \dots, v_n$ with eigenvalues $\lambda_1 , \dots, \lambda_n$. Moreover, by regularity we know one of these vectors is the all 1’s vector, with eigenvalue $r$. Call this $v_1 = (1, \dots, 1), \lambda_1 = r$. By orthogonality of $v_1$ with the other $v_i$, we know that $J_nv_i = 0$. We also know that, since $A$ is an adjacency matrix with zeros on the diagonal, the trace of $A$ is $\sum_i \lambda_i = 0$.

Multiply the matrices in the equation above by any $v_i$, $i > 1$ to get

\displaystyle \begin{aligned}A^2v_i &= rv_i - v_i - Av_i \\ \lambda_i^2v_i &= rv_i - v_i - \lambda_i v_i \end{aligned}

Rearranging and factoring out $v_i$ gives $\lambda_i^2 - \lambda_i - (r+1) = 0$. Let $z = 4r - 3$, then the non-$r$ eigenvalues must be one of the two roots: $\mu_1 = (-1 + \sqrt{z}) / 2$ or $\mu_2 = (-1 - \sqrt{z})/2$.

Say that $\mu_1$ occurs $a$ times and $\mu_2$ occurs $b$ times, then $n = a + b + 1$. So we have the following equations.

\displaystyle \begin{aligned} a + b + 1 &= n \\ r + a \mu_1 + b\mu_2 &= 0 \end{aligned}

From this equation you can easily derive that $\sqrt{z}$ is an integer, and as a consequence $r = (m^2 + 3) / 4$ for some integer $m$. With a tiny bit of extra algebra, this gives

$\displaystyle m(m^3 - 2m - 16(a-b)) = 15$

Implying that $m$ divides $15$, meaning $m \in \{ 1, 3, 5, 15\}$, and as a consequence $r \in \{ 1, 3, 7, 57\}$.

$\square$

Discussion: This is a strikingly clever use of spectral graph theory to answer a question about combinatorics. Spectral graph theory is precisely that, the study of what linear algebra can tell us about graphs. For an deeper dive into spectral graph theory, see the guest post I wrote on With High Probability.

If you allow for even girth, there are a few extra (infinite families of) Moore graphs, see Wikipedia for a list.

With additional techniques, one can also disprove the existence of any Moore graphs that are not among the known ones, with the exception of a possible Moore graph of girth $5$ and degree $57$ on $n = 3250$ vertices. It is unknown whether such a graph exists, but if it does, it is known that

You should go out and find it or prove it doesn’t exist.

Hungry for more applications of linear algebra to combinatorics and computer science? The book Thirty-Three Miniatures is a fantastically entertaining book of linear algebra gems (it’s where I found the proof in this post). The exposition is lucid, and the chapters are short enough to read on my daily train commute.

Zero Knowledge Proofs — A Primer

In this post we’ll get a strong taste for zero knowledge proofs by exploring the graph isomorphism problem in detail. In the next post, we’ll see how this relates to cryptography and the bigger picture. The goal of this post is to get a strong understanding of the terms “prover,” “verifier,” and “simulator,” and “zero knowledge” in the context of a specific zero-knowledge proof. Then next time we’ll see how the same concepts (though not the same proof) generalizes to a cryptographically interesting setting.

Graph isomorphism

Let’s start with an extended example. We are given two graphs $G_1, G_2$, and we’d like to know whether they’re isomorphic, meaning they’re the same graph, but “drawn” different ways.

The problem of telling if two graphs are isomorphic seems hard. The pictures above, which are all different drawings of the same graph (or are they?), should give you pause if you thought it was easy.

To add a tiny bit of formalism, a graph $G$ is a list of edges, and each edge $(u,v)$ is a pair of integers between 1 and the total number of vertices of the graph, say $n$. Using this representation, an isomorphism between $G_1$ and $G_2$ is a permutation $\pi$ of the numbers $\{1, 2, \dots, n \}$ with the property that $(i,j)$ is an edge in $G_1$ if and only if $(\pi(i), \pi(j))$ is an edge of $G_2$. You swap around the labels on the vertices, and that’s how you get from one graph to another isomorphic one.

Given two arbitrary graphs as input on a large number of vertices $n$, nobody knows of an efficient—i.e., polynomial time in $n$—algorithm that can always decide whether the input graphs are isomorphic. Even if you promise me that the inputs are isomorphic, nobody knows of an algorithm that could construct an isomorphism. (If you think about it, such an algorithm could be used to solve the decision problem!)

A game

Now let’s play a game. In this game, we’re given two enormous graphs on a billion nodes. I claim they’re isomorphic, and I want to prove it to you. However, my life’s fortune is locked behind these particular graphs (somehow), and if you actually had an isomorphism between these two graphs you could use it to steal all my money. But I still want to convince you that I do, in fact, own all of this money, because we’re about to start a business and you need to know I’m not broke.

Is there a way for me to convince you beyond a reasonable doubt that these two graphs are indeed isomorphic? And moreover, could I do so without you gaining access to my secret isomorphism? It would be even better if I could guarantee you learn nothing about my isomorphism or any isomorphism, because even the slightest chance that you can steal my money is out of the question.

Zero knowledge proofs have exactly those properties, and here’s a zero knowledge proof for graph isomorphism. For the record, $G_1$ and $G_2$ are public knowledge, (common inputs to our protocol for the sake of tracking runtime), and the protocol itself is common knowledge. However, I have an isomorphism $f: G_1 \to G_2$ that you don’t know.

Step 1: I will start by picking one of my two graphs, say $G_1$, mixing up the vertices, and sending you the resulting graph. In other words, I send you a graph $H$ which is chosen uniformly at random from all isomorphic copies of $G_1$. I will save the permutation $\pi$ that I used to generate $H$ for later use.

Step 2: You receive a graph $H$ which you save for later, and then you randomly pick an integer $t$ which is either 1 or 2, with equal probability on each. The number $t$ corresponds to your challenge for me to prove $H$ is isomorphic to $G_1$ or $G_2$. You send me back $t$, with the expectation that I will provide you with an isomorphism between $H$ and $G_t$.

Step 3: Indeed, I faithfully provide you such an isomorphism. If I you send me $t=1$, I’ll give you back $\pi^{-1} : H \to G_1$, and otherwise I’ll give you back $f \circ \pi^{-1}: H \to G_2$. Because composing a fixed permutation with a uniformly random permutation is again a uniformly random permutation, in either case I’m sending you a uniformly random permutation.

Step 4: You receive a permutation $g$, and you can use it to verify that $H$ is isomorphic to $G_t$. If the permutation I sent you doesn’t work, you’ll reject my claim, and if it does, you’ll accept my claim.

Before we analyze, here’s some Python code that implements the above scheme. You can find the full, working example in a repository on this blog’s Github page.

First, a few helper functions for generating random permutations (and turning their list-of-zero-based-indices form into a function-of-positive-integers form)

import random

def randomPermutation(n):
L = list(range(n))
random.shuffle(L)
return L

def makePermutationFunction(L):
return lambda i: L[i - 1] + 1

def makeInversePermutationFunction(L):
return lambda i: 1 + L.index(i - 1)

def applyIsomorphism(G, f):
return [(f(i), f(j)) for (i, j) in G]


Here’s a class for the Prover, the one who knows the isomorphism and wants to prove it while keeping the isomorphism secret:

class Prover(object):
def __init__(self, G1, G2, isomorphism):
'''
isomomorphism is a list of integers representing
an isomoprhism from G1 to G2.
'''
self.G1 = G1
self.G2 = G2
self.n = numVertices(G1)
assert self.n == numVertices(G2)

self.isomorphism = isomorphism
self.state = None

def sendIsomorphicCopy(self):
isomorphism = randomPermutation(self.n)
pi = makePermutationFunction(isomorphism)

H = applyIsomorphism(self.G1, pi)

self.state = isomorphism
return H

def proveIsomorphicTo(self, graphChoice):
randomIsomorphism = self.state
piInverse = makeInversePermutationFunction(randomIsomorphism)

if graphChoice == 1:
return piInverse
else:
f = makePermutationFunction(self.isomorphism)
return lambda i: f(piInverse(i))


The prover has two methods, one for each round of the protocol. The first creates an isomorphic copy of $G_1$, and the second receives the challenge and produces the requested isomorphism.

And here’s the corresponding class for the verifier

class Verifier(object):
def __init__(self, G1, G2):
self.G1 = G1
self.G2 = G2
self.n = numVertices(G1)
assert self.n == numVertices(G2)

def chooseGraph(self, H):
choice = random.choice([1, 2])
self.state = H, choice
return choice

def accepts(self, isomorphism):
'''
Return True if and only if the given isomorphism
is a valid isomorphism between the randomly
chosen graph in the first step, and the H presented
by the Prover.
'''
H, choice = self.state
graphToCheck = [self.G1, self.G2][choice - 1]
f = isomorphism

isValidIsomorphism = (graphToCheck == applyIsomorphism(H, f))
return isValidIsomorphism


Then the protocol is as follows:

def runProtocol(G1, G2, isomorphism):
p = Prover(G1, G2, isomorphism)
v = Verifier(G1, G2)

H = p.sendIsomorphicCopy()
choice = v.chooseGraph(H)
witnessIsomorphism = p.proveIsomorphicTo(choice)

return v.accepts(witnessIsomorphism)


Analysis: Let’s suppose for a moment that everyone is honestly following the rules, and that $G_1, G_2$ are truly isomorphic. Then you’ll always accept my claim, because I can always provide you with an isomorphism. Now let’s suppose that, actually I’m lying, the two graphs aren’t isomorphic, and I’m trying to fool you into thinking they are. What’s the probability that you’ll rightfully reject my claim?

Well, regardless of what I do, I’m sending you a graph $H$ and you get to make a random choice of $t = 1, 2$ that I can’t control. If $H$ is only actually isomorphic to either $G_1$ or $G_2$ but not both, then so long as you make your choice uniformly at random, half of the time I won’t be able to produce a valid isomorphism and you’ll reject. And unless you can actually tell which graph $H$ is isomorphic to—an open problem, but let’s say you can’t—then probability 1/2 is the best you can do.

Maybe the probability 1/2 is a bit unsatisfying, but remember that we can amplify this probability by repeating the protocol over and over again. So if you want to be sure I didn’t cheat and get lucky to within a probability of one-in-one-trillion, you only need to repeat the protocol 30 times. To be surer than the chance of picking a specific atom at random from all atoms in the universe, only about 400 times.

If you want to feel small, think of the number of atoms in the universe. If you want to feel big, think of its logarithm.

Here’s the code that repeats the protocol for assurance.

def convinceBeyondDoubt(G1, G2, isomorphism, errorTolerance=1e-20):
probabilityFooled = 1

while probabilityFooled &gt; errorTolerance:
result = runProtocol(G1, G2, isomorphism)
assert result
probabilityFooled *= 0.5
print(probabilityFooled)


Running it, we see it succeeds

\$ python graph-isomorphism.py
0.5
0.25
0.125
0.0625
0.03125
...
&amp;lt;SNIP&amp;gt;
...
1.3552527156068805e-20
6.776263578034403e-21


So it’s clear that this protocol is convincing.

But how can we be sure that there’s no leakage of knowledge in the protocol? What does “leakage” even mean? That’s where this topic is the most difficult to nail down rigorously, in part because there are at least three a priori different definitions! The idea we want to capture is that anything that you can efficiently compute after the protocol finishes (i.e., you have the content of the messages sent to you by the prover) you could have computed efficiently given only the two graphs $G_1, G_2$, and the claim that they are isomorphic.

Another way to say it is that you may go through the verification process and feel happy and confident that the two graphs are isomorphic. But because it’s a zero-knowledge proof, you can’t do anything with that information more than you could have done if you just took the assertion on blind faith. I’m confident there’s a joke about religion lurking here somewhere, but I’ll just trust it’s funny and move on.

In the next post we’ll expand on this “leakage” notion, but before we get there it should be clear that the graph isomorphism protocol will have the strongest possible “no-leakage” property we can come up with. Indeed, in the first round the prover sends a uniform random isomorphic copy of $G_1$ to the verifier, but the verifier can compute such an isomorphism already without the help of the prover. The verifier can’t necessarily find the isomorphism that the prover used in retrospect, because the verifier can’t solve graph isomorphism. Instead, the point is that the probability space of “$G_1$ paired with an $H$ made by the prover” and the probability space of “$G_1$ paired with $H$ as made by the verifier” are equal. No information was leaked by the prover.

For the second round, again the permutation $\pi$ used by the prover to generate $H$ is uniformly random. Since composing a fixed permutation with a uniform random permutation also results in a uniform random permutation, the second message sent by the prover is uniformly random, and so again the verifier could have constructed a similarly random permutation alone.

Let’s make this explicit with a small program. We have the honest protocol from before, but now I’m returning the set of messages sent by the prover, which the verifier can use for additional computation.

def messagesFromProtocol(G1, G2, isomorphism):
p = Prover(G1, G2, isomorphism)
v = Verifier(G1, G2)

H = p.sendIsomorphicCopy()
choice = v.chooseGraph(H)
witnessIsomorphism = p.proveIsomorphicTo(choice)

return [H, choice, witnessIsomorphism]


To say that the protocol is zero-knowledge (again, this is still colloquial) is to say that anything that the verifier could compute, given as input the return value of this function along with $G_1, G_2$ and the claim that they’re isomorphic, the verifier could also compute given only $G_1, G_2$ and the claim that $G_1, G_2$ are isomorphic.

It’s easy to prove this, and we’ll do so with a python function called simulateProtocol.

def simulateProtocol(G1, G2):
# Construct data drawn from the same distribution as what is
# returned by messagesFromProtocol
choice = random.choice([1, 2])
G = [G1, G2][choice - 1]
n = numVertices(G)

isomorphism = randomPermutation(n)
pi = makePermutationFunction(isomorphism)
H = applyIsomorphism(G, pi)

return H, choice, pi


The claim is that the distribution of outputs to messagesFromProtocol and simulateProtocol are equal. But simulateProtocol will work regardless of whether $G_1, G_2$ are isomorphic. Of course, it’s not convincing to the verifier because the simulating function made the choices in the wrong order, choosing the graph index before making $H$. But the distribution that results is the same either way.

So if you were to use the actual Prover/Verifier protocol outputs as input to another algorithm (say, one which tries to compute an isomorphism of $G_1 \to G_2$), you might as well use the output of your simulator instead. You’d have no information beyond hard-coding the assumption that $G_1, G_2$ are isomorphic into your program. Which, as I mentioned earlier, is no help at all.

In this post we covered one detailed example of a zero-knowledge proof. Next time we’ll broaden our view and see the more general power of zero-knowledge (that it captures all of NP), and see some specific cryptographic applications. Keep in mind the preceding discussion, because we’re going to re-use the terms “prover,” “verifier,” and “simulator” to mean roughly the same things as the classes Prover, Verifier and the function simulateProtocol.

Until then!

Hashing to Estimate the Size of a Stream

Problem: Estimate the number of distinct items in a data stream that is too large to fit in memory.

Solution: (in python)

import random

def randomHash(modulus):
a, b = random.randint(0,modulus-1), random.randint(0,modulus-1)
def f(x):
return (a*x + b) % modulus
return f

def average(L):
return sum(L) / len(L)

def numDistinctElements(stream, numParallelHashes=10):
modulus = 2**20
hashes = [randomHash(modulus) for _ in range(numParallelHashes)]
minima = [modulus] * numParallelHashes
currentEstimate = 0

for i in stream:
hashValues = [h(i) for h in hashes]
for i, newValue in enumerate(hashValues):
if newValue < minima[i]:
minima[i] = newValue

currentEstimate = modulus / average(minima)

yield currentEstimate


Discussion: The technique used here is to use random hash functions. The central idea is the same as the general principle presented in our recent post on hashing for load balancing. In particular, if you have an algorithm that works under the assumption that the data is uniformly random, then the same algorithm will work (up to a good approximation) if you process the data through a randomly chosen hash function.

So if we assume the data in the stream consists of $N$ uniformly random real numbers between zero and one, what we would do is the following. Maintain a single number $x_{\textup{min}}$ representing the minimum element in the list, and update it every time we encounter a smaller number in the stream. A simple probability calculation or an argument by symmetry shows that the expected value of the minimum is $1/(N+1)$. So your estimate would be $1/(x_{\textup{min}}+1)$. (The extra +1 does not change much as we’ll see.) One can spend some time thinking about the variance of this estimate (indeed, our earlier post is great guidance for how such a calculation would work), but since the data is not random we need to do more work. If the elements are actually integers between zero and $k$, then this estimate can be scaled by $k$ and everything basically works out the same.

Processing the data through a hash function $h$ chosen randomly from a 2-universal family (and we proved in the aforementioned post that this modulus thing is 2-universal) makes the outputs “essentially random” enough to have the above technique work with some small loss in accuracy. And to reduce variance, you can process the stream in parallel with many random hash functions. This rough sketch results in the code above. Indeed, before I state a formal theorem, let’s see the above code in action. First on truly random data:

S = [random.randint(1,2**20) for _ in range(10000)]

for k in range(10,301,10):
for est in numDistinctElements(S, k):
pass
print(abs(est))

# output
18299.75567190227
7940.7497160166595
12034.154552410098
12387.19432959244
15205.56844547564
8409.913113220158
8057.99978043693
9987.627098464103
10313.862295081966
9084.872639057356
10952.745228373375
10360.569781803211
11022.469475216301
9741.250165892501
11474.896038520465
10538.452261306533
10068.793492995934
10100.266495424627
9780.532155130093
8806.382800033594
10354.11482578643
10001.59202254498
10623.87031408308
9400.404915767062
10710.246772348424
10210.087633885101
9943.64709187974
10459.610972568578
10159.60175069326
9213.120899718839


As you can see the output is never off by more than a factor of 2. Now with “adversarial data.”

S = range(10000) #[random.randint(1,2**20) for _ in range(10000)]

for k in range(10,301,10):
for est in numDistinctElements(S, k):
pass
print(abs(est))

# output

12192.744186046511
15935.80547112462
10167.188106011634
12977.425742574258
6454.364151175674
7405.197740112994
11247.367453263867
4261.854392115023
8453.228233608026
7706.717624577393
7582.891328643745
5152.918628936483
1996.9365093316926
8319.20208545846
3259.0787592465967
6812.252720480753
4975.796789951151
8456.258064516129
8851.10133724288
7317.348220516398
10527.871485943775
3999.76974425661
3696.2999065091117
8308.843106180666
6740.999794281012
8468.603733730935
5728.532232608959
5822.072220349402
6382.349459544548
8734.008940222673


The estimates here are off by a factor of up to 5, and this estimate seems to get better as the number of hash functions used increases. The formal theorem is this:

Theorem: If $S$ is the set of distinct items in the stream and $n = |S|$ and $m > 100 n$, then with probability at least 2/3 the estimate $m / x_{\textup{min}}$ is between $n/6$ and $6n$.

We omit the proof (see below for references and better methods). As a quick analysis, since we’re only storing a constant number of integers at any given step, the algorithm has space requirement $O(\log m) = O(\log n)$, and each step takes time polynomial in $\log(m)$ to update in each step (since we have to compute multiplication and modulus of $m$).

This method is just the first ripple in a lake of research on this topic. The general area is called “streaming algorithms,” or “sublinear algorithms.” This particular problem, called cardinality estimation, is related to a family of problems called estimating frequency moments. The literature gets pretty involved in the various tradeoffs between space requirements and processing time per stream element.

As far as estimating cardinality goes, the first major results were due to Flajolet and Martin in 1983, where they provided a slightly more involved version of the above algorithm, which uses logarithmic space.

Later revisions to the algorithm (2003) got the space requirement down to $O(\log \log n)$, which is exponentially better than our solution. And further tweaks and analysis improved the variance bounds to something like a multiplicative factor of $\sqrt{m}$. This is called the HyperLogLog algorithm, and it has been tested in practice at Google.

Finally, a theoretically optimal algorithm (achieving an arbitrarily good estimate with logarithmic space) was presented and analyzed by Kane et al in 2010.

The Čech Complex and the Vietoris-Rips Complex

It’s about time we got back to computational topology. Previously in this series we endured a lightning tour of the fundamental group and homology, then we saw how to compute the homology of a simplicial complex using linear algebra.

What we really want to do is talk about the inherent shape of data. Homology allows us to compute some qualitative features of a given shape, i.e., find and count the number of connected components or a given shape, or the number of “2-dimensional holes” it has. This is great, but data doesn’t come in a form suitable for computing homology. Though they may have originated from some underlying process that follows nice rules, data points are just floating around in space with no obvious connection between them.

Here is a cool example of Thom Yorke, the lead singer of the band Radiohead, whose face was scanned with a laser scanner for their music video “House of Cards.”

Radiohead’s Thom Yorke in the music video for House of Cards (click the image to watch the video).

Given a point cloud such as the one above, our long term goal (we’re just getting started in this post) is to algorithmically discover what the characteristic topological features are in the data. Since homology is pretty coarse, we might detect the fact that the point cloud above looks like a hollow sphere with some holes in it corresponding to nostrils, ears, and the like. The hope is that if the data set isn’t too corrupted by noise, then it’s a good approximation to the underlying space it is sampled from. By computing the topological features of a point cloud we can understand the process that generated it, and Science can proceed.

But it’s not always as simple as Thom Yorke’s face. It turns out the producers of the music video had to actually degrade the data to get what you see above, because their lasers were too precise and didn’t look artistic enough! But you can imagine that if your laser is mounted on a car on a bumpy road, or tracking some object in the sky, or your data comes from acoustic waves traveling through earth, you’re bound to get noise. Or more realistically, if your data comes from thousands of stock market prices then the process generating the data is super mysterious. It changes over time, it may not follow any discernible pattern (though speculators may hope it does), and you can’t hope to visualize the entire dataset in any useful way.

But with persistent homology, so the claim goes, you’d get a good qualitative understanding of the dataset. Your results would be resistant to noise inherent in the data. It also wouldn’t be sensitive to the details of your data cleaning process. And with a dash of ingenuity, you can come up with a reasonable mathematical model of the underlying generative process. You could use that model to design algorithms, make big bucks, discover new drugs, recognize pictures of cats, or whatever tickles your fancy.

But our first problem is to resolve the input data type error. We want to use homology to describe data, but our data is a point cloud and homology operates on simplicial complexes. In this post we’ll see two ways one can do this, and see how they’re related.

The Čech complex

Let’s start with the Čech complex. Given a point set $X$ in some metric space and a number $\varepsilon > 0$, the Čech complex $C_\varepsilon$ is the simplicial complex whose simplices are formed as follows. For each subset $S \subset X$ of points, form a $(\varepsilon/2)$-ball around each point in $S$, and include $S$ as a simplex (of dimension $|S|$) if there is a common point contained in all of the balls in $S$. This structure obviously satisfies the definition of a simplicial complex: any sub-subset $S' \subset S$ of a simplex $S$ will be also be a simplex. Here is an example of the epsilon balls.

An example of a point cloud (left) and a corresponding choice of (epsilon/2)-balls. To get the Cech complex, we add a k-simplex any time we see a subset of k points with common intersection.  [Image credit: Robert Ghrist]

Let me superscript the Čech complex to illustrate the pieces. Specifically, we’ll let $C_\varepsilon^{j}$ denote all the simplices of dimension up to $j$. In particular, $C_\varepsilon^1$ is a graph where an edge is placed between $x,y$ if $d(x,y) < \varepsilon$, and $C_{\varepsilon}^2$ places triangles (2-simplices) on triples of points whose balls have a three-way intersection.

A topologist will have a minor protest here: the simplicial complex is supposed to resemble the structure inherent in the underlying points, but how do we know that this abstract simplicial complex (which is really hard to visualize!) resembles the topological space we used to make it? That is, $X$ was sitting in some metric space, and the union of these epsilon-balls forms some topological space $X(\varepsilon)$ that is close in structure to $X$. But is the Čech complex $C_\varepsilon$ close to $X(\varepsilon)$? Do they have the same topological structure? It’s not a trivial theorem to prove, but it turns out to be true.

The Nerve Theorem: The homotopy types of $X(\varepsilon)$ and $C_\varepsilon$ are the same.

We won’t remind the readers about homotopy theory, but suffice it to say that when two topological spaces have the same homotopy type, then homology can’t distinguish them. In other words, if homotopy type is too coarse for a discriminator for our dataset, then persistent homology will fail us for sure.

So this theorem is a good sanity check. If we want to learn about our point cloud, we can pick a $\varepsilon$ and study the topology of the corresponding Čech complex $C_\varepsilon$. The reason this is called the “Nerve Theorem” is because one can generalize it to an arbitrary family of convex sets. Given some family $F$ of convex sets, the nerve is the complex obtained by adding simplices for mutually overlapping subfamilies in the same way. The nerve theorem is actually more general, it says that with sufficient conditions on the family $F$ being “nice,” the resulting Čech complex has the same topological structure as $F$.

The problem is that Čech complexes are tough to compute. To tell whether there are any 10-simplices (without additional knowledge) you have to inspect all subsets of size 10. In general computing the entire complex requires exponential time in the size of $X$, which is extremely inefficient. So we need a different kind of complex, or at least a different representation to compensate.

The Vietoris-Rips complex

The Vietoris-Rips complex is essentially the same as the Čech complex, except instead of adding a $d$-simplex when there is a common point of intersection of all the $(\varepsilon/2)$-balls, we just do so when all the balls have pairwise intersections. We’ll denote the Vietoris-Rips complex with parameter $\varepsilon$ as $VR_{\varepsilon}$.

Here is an example to illustrate: if you give me three points that are the vertices of an equilateral triangle of side length 1, and I draw $(1/2)$-balls around each point, then they will have all three pairwise intersections but no common point of intersection.

Three balls which intersect pairwise, but have no point of triple intersection. With appropriate parameters, the Cech and V-R complexes are different.

So in this example the Vietoris-Rips complex is a graph with a 2-simplex, while the Čech complex is just a graph.

One obvious question is: do we still get the benefits of the nerve theorem with Vietoris-Rips complexes? The answer is no, obviously, because the Vietoris-Rips complex and Čech complex in this triangle example have totally different topology! But everything’s not lost. What we can do instead is compare Vietoris-Rips and Čech complexes with related parameters.

Theorem: For all $\varepsilon > 0$, the following inclusions hold

$\displaystyle C_{\varepsilon} \subset VR_{\varepsilon} \subset C_{2\varepsilon}$

So if the Čech complexes for both $\varepsilon$ and $2\varepsilon$ are good approximations of the underlying data, then so is the Vietoris-Rips complex. In fact, you can make this chain of inclusions slightly tighter, and if you’re interested you can see Theorem 2.5 in this recent paper of de Silva and Ghrist.

Now your first objection should be that computing a Vietoris-Rips complex still requires exponential time, because you have to scan all subsets for the possibility that they form a simplex. It’s true, but one nice thing about the Vietoris-Rips complex is that it can be represented implicitly as a graph. You just include an edge between two points if their corresponding balls overlap. Once we want to compute the actual simplices in the complex we have to scan for cliques in the graph, so that sucks. But it turns out that computing the graph is the first step in other more efficient methods for computing (or approximating) the VR complex.

Let’s go ahead and write a (trivial) program that computes the graph representation of the Vietoris-Rips complex of a given data set.

import numpy
def naiveVR(points, epsilon):
points = [numpy.array(x) for x in points]
vrComplex = [(x,y) for (x,y) in combinations(points, 2) if norm(x - y) < 2*epsilon]
return numpy.array(vrComplex)


Let’s try running it on a modestly large example: the first frame of the Radiohead music video. It’s got about 12,000 points in $\mathbb{R}^4$ (x,y,z,intensity), and sadly it takes about twenty minutes. There are a couple of ways to make it more efficient. One is to use specially-crafted data structures for computing threshold queries (i.e., find all points within $\varepsilon$ of this point). But those are only useful for small thresholds, and we’re interested in sweeping over a range of thresholds. Another is to invoke approximations of the data structure which give rise to “approximate” Vietoris-Rips complexes.

Other stuff

In a future post we’ll implement a method for speeding up the computation of the Vietoris-Rips complex, since this is the primary bottleneck for topological data analysis. But for now the conceptual idea of how Čech complexes and Vietoris-Rips complexes can be used to turn point clouds into simplicial complexes in reasonable ways.

Before we close we should mention that there are other ways to do this. I’ve chosen the algebraic flavor of topological data analysis due to my familiarity with algebra and the work based on this approach. The other approaches have a more geometric flavor, and are based on the Delaunay triangulation, a hallmark of computational geometry algorithms. The two approaches I’ve heard of are called the alpha complex and the flow complex. The downside of these approaches is that, because they are based on the Delaunay triangulation, they have poor scaling in the dimension of the data. Because high dimensional data is crucial, many researchers have been spending their time figuring out how to speed up approximations of the V-R complex. See these slides of Afra Zomorodian for an example.

Until next time!