# Two’s Complement and Group Theory

Before I discovered math, I was a first year undergrad computer science student taking Electrical Engineering 101. The first topic I learned was what bits and boolean gates are, and the second was the two’s complement representation of a negative n-bit integer.

At the time two’s complement seemed to me like a bizarre quirk of computer programming, with minutiae you just had to memorize. If the leading bit is 1, it’s negative, and otherwise it’s positive. To negate a number, flip the bits and add one. Except of course the number 1000 0000, which in 8-bit two’s complement is -128, because its “negative” according to this operation is also 1000 0000 = -128.

The quirky negation operation is usually explained by appealing to the “borrowing” operation from elementary school subtraction. Or an appeal to a sense of shared helplessness. That’s just the way things are in two’s complement. But it doesn’t quite explain why one can use the same boolean circuit to compute addition and multiplication of unsigned and two’s complement signed integers.

Math can provide a different, clearer explanation for the quirkiness. It’s not arbitrary, but forced. The explanation uses group theory, which I wrote a primer about a long time ago. The main ideas from group theory used here are cyclic groups, quotient groups, group isomorphisms, and group actions.

The first insight is that the signed integers are the exact same as the unsigned integers as a quotient group, just with a different equivalence class representative chosen for the arithmetic.

The set of $n$-bit unsigned integers $\mathbb{Z}/2^n\mathbb{Z} = \{ 0, 1, \dots, 2^n-1 \}$ forms a group under addition modulo $2^n$. The set of $n$-bit signed integers $\{ -2^{n-1}, \dots, -1, 0, 1, \dots, 2^{n-1} – 1 \}$ also forms a group, but it’s harder to see why. The operation is still “addition modulo $2^n$”, but you have to recall the definition of a quotient group to see why.

The elements of the quotient group $\mathbb{Z}/2^n \mathbb{Z}$ aren’t integers, but rather equivalence classes of integers. We denote $[x] \in \mathbb{Z}/2^n \mathbb{Z}$ as the equivalence class of $x \in \mathbb{Z}$. For example, the “element” $1$ is the equivalence class $[1] = \{ \dots, -2^n+1, 1, 2^n + 1, 2\cdot 2^n + 1, \dots \}$. And the whole point of defining equivalence classes is that you can choose any representative element from an equivalence class, do the arithmetic using that element, and you get the same (equivalence class) output as you would if you had used the “usual” representative. In other words, it’s guaranteed that $[x+y] = [x] + [y] = [x’] + [y’] = [x’ + y’]$, so long as $[x] = [x’]$ and $[y] = [y’]$.

Signed integers are exactly this: choosing a different representative for some of the equivalence classes. The element $[2^{n-1}] = \{ \dots, -2\cdot 2^{n-1}, -2^{n-1}, 2^{n-1}, 2\cdot 2^{n-1}, \dots \}$ is represented by $-2^{n-1}$, the element $[2^{n-1} + 1]$ is represented by $-2^{n-1} + 1$, all the way up to $2^n – 1$ being represented by $-1$. This is a formal way to say two’s complement is a “reinterpretation” of the bits of an unsigned integer. It’s not an arbitrary reinterpretation, but one that satisfies the exact same arithmetic structure as the unsigned interpretation.

So any circuit representation of an addition operation that works for unsigned integers can also be used for addition for all possible alternative choices of equivalence class representatives. Note this has nothing to do with the way these numbers are represented as bits. It’s a guarantee of the mathematics underlying modular arithmetic. It would work if the numbers were represented in ternary or base 10 (excluding the fact that there would be some extra inputs and outputs not in the set). Also note it applies equally well to multiplication, since everything said here applies equally well to the quotient ring of integers modulo $2^n$.

It also makes it clear that the choice of $-2^{n-1}$ as the smallest negative number is arbitrary, and one could equally well include $2^n-1$ as the largest positive, leaving $-2^{n-1} + 1$ as the smallest negative. I think the current standard choice ($-2^{n-1}$ as the smallest negative number) is objectively better than this alternative, because it allows one to use the leading sign bit as negativeness/overflow/underflow detector.

Now we can study the negation operation in this group. For starters, for each element $x$ except $-2^{n-1}$, the number $-x$ is already a signed $n$-bit integer. So negation is just negation (we’ll get back to how this is interpreted as bits in a second). And if we think of $-2^{n-1}$ as an equivalence class, it’s the same as $2^{n-1}$, which is its own inverse as an unsigned integer. Add it to itself and you get $2^n \equiv 0 \mod 2^n$. Since equivalence classes behave identically no matter the representative, $-2^{n-1}$ must be its own inverse, and hence $-(-2^{n-1}) = -2^{n-1}$ as signed integers.

To interpret it as bits, again lift back to unsigned integers, and notice that “negation” for an unsigned integer $x$—that is, finding the value $y$ such that $x+y \equiv 0 \mod 2^n$—is computed via $-[x] = [2^n – x]$ (subtraction inside the equivalence class is defined even though it’s being used to define $-[x]$, because inside an equivalence class the arithmetic is just arithmetic on integers). But because $2^n$ isn’t representable in concrete $n$-bit numbers (it’s just zero), to compute it entirely within $n$-bit integers, we need to break the arithmetic down further:

$$[2^n – x ] = [ 2^n – 1 – x + 1 ] = [ 2^n – 1 – x ] + [ 1 ]$$

$2^n-1$ is the string of all 1’s, hence $2^n – 1 – x$ is equivalent to flipping the bits of $x$, and adding 1 finishes the calculation.

Finally, this $-2^{n-1}$-is-its-own-inverse business. If you view the group of unsigned integers as a circle—a natural choice because addition “wraps around” like clock math—the negation operation can be seen as a reflection across the line passing through $0$ and the circle’s center. Naturally this makes the negation of $0$ equal to $0$. The other point that line of symmetry passes through is $[2^{n-1}] = [-2^{n-1}]$, and so it must also fix $-2^{n-1}$.

This is explanation enough, but the pesky question is whether it’s forced. Is there some other representation better than two’s complement for which this behavior doesn’t occur? So long as our number system has an even size and 0 is its own inverse, then simple counting shows it’s impossible. We’d have an odd number $2^n – 1$ of nonzero elements, and the inverse operation would match them up in pairs, producing an even total unless one of the “pairs” is a singleton.

Aside: The Wikipedia article on this topic had a somewhat sloppy explanation using group actions (really, Burnside’s lemma), which, while correct, is like using the Millennium Falcon to shoot a sparrow. I simplified it to the above counting argument.

# Modulus Switching in LWE

The Learning With Errors problem is the basis of a few cryptosystems, and a foundation for many fully homomorphic encryption (FHE) schemes. In this article I’ll describe a technique used in some of these schemes called modulus switching.

In brief, an LWE sample is a vector of values in $\mathbb{Z}/q\mathbb{Z}$ for some $q$, and in LWE cryptosystems an LWE sample can be modified so that it hides a secret message $m$. Modulus switching allows one to convert an LWE encryption from having entries in $\mathbb{Z}/q{Z}$ to entries in some other $\mathbb{Z}/q'{Z}$, i.e., change the modulus from $q$ to $q’ < q$.

The reason you’d want to do this are a bit involved, so I won’t get into them here and instead back-reference this article in the future.

## LWE encryption

Briefly, the LWE encryption scheme I’ll use has the following parameters:

• A plaintext space $\mathbb{Z}/q\mathbb{Z}$, where $q \geq 2$ is a positive integer. This is the space that the underlying message comes from.
• An LWE dimension $n \in \mathbb{N}$.
• A discrete Gaussian error distribution $D$ with a mean of zero and a fixed standard deviation.

An LWE secret key is defined as a vector in $\{0, 1\}^n$ (uniformly sampled). An LWE ciphertext is defined as a vector $a = (a_1, \dots, a_n)$, sampled uniformly over $(\mathbb{Z} / q\mathbb{Z})^n$, and a scalar $b = \langle a, s \rangle + m + e$, where $e$ is drawn from $D$ and all arithmetic is done modulo $q$.

Without the error term, an attacker could determine the secret key from a polynomial-sized collection of LWE ciphertexts with something like Gaussian elimination. The set of samples looks like a linear (or affine) system, where the secret key entries are the unknown variables. With an error term, the problem of solving the system is believed to be hard, and only exponential time/space algorithms are known.

However, the error term in an LWE encryption encompasses all of the obstacles to FHE. For starters, if your message is $m=1$ and the error distribution is wide (say, a standard deviation of 10), then the error will completely obscure the message from the start. You can’t decrypt the LWE ciphertext because you can’t tell if the error generated in a particular instance was 9 or 10. So one thing people do is have a much smaller cleartext space (actual messages) and encode cleartexts as plaintexts by putting the messages in the higher-order bits of the plaintext space. E.g., you can encode 10-bit messages in the top 10 bits of a 32-bit integer, and leave the remaining 22 bits of the plaintext for the error distribution.

Moreover, for FHE you need to be able to add and multiply ciphertexts to get the corresponding sum/product of the underlying plaintexts. One can easily see that adding two LWE ciphertexts produces an LWE ciphertext of the sum of the plaintexts (multiplication is harder and beyond the scope of this article). Summing ciphertexts also sums the error terms together. So the error grows with each homomorphic operation, and eventually the error may overtake the message, at which point decryption fails. How to deal with this error accumulation is 99% of the difficulty of FHE.

Finally, because the error can be negative, even if you store a message in the high-order bits of the plaintext, you can’t decrypt by simply clearing the low order error bits. In that case an error of -1 would result in a corrupted message. Instead, to decrypt, we round the value $b – \langle a, s \rangle = m + e$ to the nearest multiple of $2^k$, where $k$ is the number of bits “reserved” for error, as described above. In particular, decryption will only succeed if the error is small enough in absolute value. So to make this work in practice, one must coordinate the encoding scheme (how many bits to reserve for error), the dimension of the vector $a$, and the standard deviation of the error distribution.

## Modulus switching

With a basic outline of an LWE ciphertext, we can talk about modulus switching.

Start with an LWE ciphertext for the plaintext $m$. Call it $(a_1, \dots, a_n, b) \in (\mathbb{Z}/q\mathbb{Z})^{n+1}$, where

$\displaystyle b = \left ( \sum_{i=1}^n a_i s_i \right ) + m + e_{\textup{original}}$

Given $q’ < q$, we would like to produce a vector $(a’_1, \dots, a’_n, b’) \in (\mathbb{Z}/q’\mathbb{Z})^{n+1}$ (all that has changed is I’ve put a prime on all the terms to indicate which are changing, most notably the new modulus $q’$) that also encrypts $m$, without knowing $m$ or $e_{\textup{original}}$, i.e., without access to the secret key.

Failed attempt: why not simply reduce each entry in the ciphertext vector modulo $q’$? That would set $a’_i = a_i \mod q’$ and $b’ = b \mod q’$. Despite the fact that this operation produces a perfectly valid equation, it won’t work. The problem is that taking $m \mod q’$ destroys part or all of the underlying message. For example, say $x$ is a 12-bit number stored in the top 12 bits of the plaintext, i.e., $m = x \cdot 2^{20}$. If $q’ = 2^{15}$, then the message is a multiple of $q’$ already, so the proposed modulus produces zero.

For this reason, we can’t hope to perfectly encrypt $m$, as the output ciphertext entries may not have a modulus large enough to represent $m$ at all. Rather, we can only hope to encrypt something like “the message $x$ that’s encoded in $m$, but instead with $x$ stored in lower order bits than $m$ originally used.” In more succinct terms, we can hope to encrypt $m’ = m q’ / q$. Indeed, the operation of $m \mapsto m q’ / q$ shifts up by $\log_2(q’)$ many bits (temporarily exceeding the maximum allowable bit length), and then shifting down by $\log_2(q)$ many bits.

For example, say the number $x=7$ is stored in the top 3 bits of a 32-bit unsigned integer ($q = 2^{32}$), i.e., $m = 7 \cdot 2^{29}$ and $q’ = 2^{10}$. Then $m q’ / q = 7 \cdot 2^{29} \cdot 2^{10} / 2^{32} = 7 \cdot 2^{29+10 – 32} = 7 \cdot 2^7$, which stores the same underlying number $x=7$, but in the top three bits of a 10-bit message. In particular, $x$ is in the same “position” in the plaintext space, while the plaintext space has shrunk around it.

Side note: because of this change to the cleartext-to-plaintext encoding, the decryption/decoding steps before and after a modulus switch are slightly different. In decryption you use different moduli, and in decoding you round to different powers of 2.

So the trick is instead to apply $z \mapsto z q’ / q$ to all the entries of the LWE ciphertext vector. However, because the entries like $a_i$ use the entire space of bits in the plaintext, this transformation will not necessarily result in an integer. So we can round the result to an integer and analyze that. The final proposal for a modulus switch is

$\displaystyle a’_i = \textup{round}(a_i q’ / q)$

$\displaystyle b’ = \textup{round}(b q’ / q)$

Because the error growth of LWE ciphertexts permeates everything, in addition to proving this transformation produces a valid ciphertext, we also have to understand how it impacts the error term.

## Analyzing the modulus switch

The statement summarizing the last section:

Theorem: Let $\mathbf{c} = (a_1, \dots, a_n, b) \in (\mathbb{Z}/q\mathbb{Z})^{n+1}$ be an LWE ciphertext encrypting plaintext $m$ with error term $e_\textup{original}$. Let $q’ < q$. Then $c’ = \textup{round}(\mathbf{c} q’ / q)$ (where rounding is performed entrywise) is an LWE encryption of $m’ = m q’ / q$, provided $m’$ is an integer.

Proof. The only substantial idea is that $\textup{round}(x) = x + \varepsilon$, where $|\varepsilon| \leq 0.5$. This is true by the definition of rounding, but that particular way to express it allows us to group the error terms across a sum-of-rounded-things in isolation, and then everything else has a factor of $q’/q$ that can be factored out. Let’s proceed.

Let $c’ = (a’_1, \dots, a’_n, b’)$, where $a’_i = \textup{round}(a_i q’ / q)$ and likewise for $b’$. need to show that $b’ = \left ( \sum_{i=1}^n a’_i s_i \right ) + m q’ / q + e_{\textup{new}}$, where $e_{\textup{new}}$ is a soon-to-be-derived error term.

Expanding $b’$ and using the “only substantial idea” above, we get

$\displaystyle b’ = \textup{round}(b q’ / q) = bq’/q + \varepsilon_b$

For some $\varepsilon_b$ with magnitude at most $1/2$. Continuing to expand, and noting that $b$ is related to the $a_i$ only modulo $q$, we have

\displaystyle \begin{aligned} b’ &= bq’/q + \varepsilon_b \\ b’ &= \left ( \left ( \sum_{i=1}^n a_i s_i \right ) + m + e_{\textup{original}} \right ) \frac{q’}{q} + \varepsilon_b \mod q \end{aligned}

Because we’re switching moduli, it makes sense to rewrite this over the integers, which means we add a term $Mq$ for some integer $M$ and continue to expand

\displaystyle \begin{aligned} b’ &= \left ( \left ( \sum_{i=1}^n a_i s_i \right ) + m + e_{\textup{original}} + Mq \right ) \frac{q’}{q} + \varepsilon_b \\ &= \left ( \sum_{i=1}^n \left ( a_i \frac{q’}{q} \right) s_i \right ) + m \frac{q’}{q} + e_{\textup{original}}\frac{q’}{q} + Mq \frac{q’}{q} + \varepsilon_b \\ &= \left ( \sum_{i=1}^n \left ( a_i \frac{q’}{q} \right) s_i \right ) + m’ + e_{\textup{original}}\frac{q’}{q} + Mq’ + \varepsilon_b \end{aligned}

The terms with $a_i$ are still missing their rounding, so, just like $b’$, rewrite $a’_i = a_i q’/q + \varepsilon_i$ as $a_i q’/q = a’_i – \varepsilon_i$, expanding, simplifying, and finally reducing modulo $q’$ to get

\displaystyle \begin{aligned} b’ &= \left ( \sum_{i=1}^n \left ( a’_i – \varepsilon_i \right) s_i \right ) + m’ + e_{\textup{original}}\frac{q’}{q} + Mq’ + \varepsilon_b \\ &= \left ( \sum_{i=1}^n a’_i s_i \right ) – \left ( \sum_{i=1}^n \varepsilon_i s_i \right) + m’ + e_{\textup{original}}\frac{q’}{q} + Mq’ + \varepsilon_b \\ &= \left ( \sum_{i=1}^n a’_i s_i \right ) + m’ + Mq’ + \left [ e_{\textup{original}}\frac{q’}{q} – \left ( \sum_{i=1}^n \varepsilon_i s_i \right) + \varepsilon_b \right ] \\ &= \left ( \sum_{i=1}^n a’_i s_i \right ) + m’ + \left [ e_{\textup{original}}\frac{q’}{q} – \left ( \sum_{i=1}^n \varepsilon_i s_i \right) + \varepsilon_b \right ] \mod q’ \end{aligned}

Define the square bracketed term as $e_{\textup{new}}$, and we have proved the theorem.

$\square$

The error after modulus switching is laid out. It’s the original error scaled, plus at most $n+1$ terms, each of which is at most $1/2$. However, note that this is larger than it appears. If the new modulus is, say, $q’=1024$, and the dimension is $n = 512$, then in the worst case the error right after modulus switching will leave us only $1$ bit left for the message. This is not altogether unrealistic, as production (128-bit) security parameters for LWE put $n$ around 600. But it is compensated for by the fact that the secret $s$ is chosen uniformly at random, and the errors are symmetric around zero. So in expectation only half the bits will be set, and half of the set bits will have a positive error, and half a negative error. Using these facts, you can bound the probability that the error exceeds, say, $\sqrt{n \log n}$ using a standard Hoeffding bound argument. I further believe that the error is bounded by $\sqrt{n}$. I have verified it empirically, but haven’t been able to quite nail down a proof.

Until next time!

# Hashing to Estimate the Size of a Stream

Problem: Estimate the number of distinct items in a data stream that is too large to fit in memory.

Solution: (in python)

import random

def randomHash(modulus):
a, b = random.randint(0,modulus-1), random.randint(0,modulus-1)
def f(x):
return (a*x + b) % modulus
return f

def average(L):
return sum(L) / len(L)

def numDistinctElements(stream, numParallelHashes=10):
modulus = 2**20
hashes = [randomHash(modulus) for _ in range(numParallelHashes)]
minima = [modulus] * numParallelHashes
currentEstimate = 0

for i in stream:
hashValues = [h(i) for h in hashes]
for i, newValue in enumerate(hashValues):
if newValue &lt; minima[i]:
minima[i] = newValue

currentEstimate = modulus / average(minima)

yield currentEstimate


Discussion: The technique used here is to use random hash functions. The central idea is the same as the general principle presented in our recent post on hashing for load balancing. In particular, if you have an algorithm that works under the assumption that the data is uniformly random, then the same algorithm will work (up to a good approximation) if you process the data through a randomly chosen hash function.

So if we assume the data in the stream consists of $N$ uniformly random real numbers between zero and one, what we would do is the following. Maintain a single number $x_{\textup{min}}$ representing the minimum element in the list, and update it every time we encounter a smaller number in the stream. A simple probability calculation or an argument by symmetry shows that the expected value of the minimum is $1/(N+1)$. So your estimate would be $1/(x_{\textup{min}}+1)$. (The extra +1 does not change much as we’ll see.) One can spend some time thinking about the variance of this estimate (indeed, our earlier post is great guidance for how such a calculation would work), but since the data is not random we need to do more work. If the elements are actually integers between zero and $k$, then this estimate can be scaled by $k$ and everything basically works out the same.

Processing the data through a hash function $h$ chosen randomly from a 2-universal family (and we proved in the aforementioned post that this modulus thing is 2-universal) makes the outputs “essentially random” enough to have the above technique work with some small loss in accuracy. And to reduce variance, you can process the stream in parallel with many random hash functions. This rough sketch results in the code above. Indeed, before I state a formal theorem, let’s see the above code in action. First on truly random data:

S = [random.randint(1,2**20) for _ in range(10000)]

for k in range(10,301,10):
for est in numDistinctElements(S, k):
pass
print(abs(est))

# output
18299.75567190227
7940.7497160166595
12034.154552410098
12387.19432959244
15205.56844547564
8409.913113220158
8057.99978043693
9987.627098464103
10313.862295081966
9084.872639057356
10952.745228373375
10360.569781803211
11022.469475216301
9741.250165892501
11474.896038520465
10538.452261306533
10068.793492995934
10100.266495424627
9780.532155130093
8806.382800033594
10354.11482578643
10001.59202254498
10623.87031408308
9400.404915767062
10710.246772348424
10210.087633885101
9943.64709187974
10459.610972568578
10159.60175069326
9213.120899718839


As you can see the output is never off by more than a factor of 2. Now with “adversarial data.”

S = range(10000) #[random.randint(1,2**20) for _ in range(10000)]

for k in range(10,301,10):
for est in numDistinctElements(S, k):
pass
print(abs(est))

# output

12192.744186046511
15935.80547112462
10167.188106011634
12977.425742574258
6454.364151175674
7405.197740112994
11247.367453263867
4261.854392115023
8453.228233608026
7706.717624577393
7582.891328643745
5152.918628936483
1996.9365093316926
8319.20208545846
3259.0787592465967
6812.252720480753
4975.796789951151
8456.258064516129
8851.10133724288
7317.348220516398
10527.871485943775
3999.76974425661
3696.2999065091117
8308.843106180666
6740.999794281012
8468.603733730935
5728.532232608959
5822.072220349402
6382.349459544548
8734.008940222673


The estimates here are off by a factor of up to 5, and this estimate seems to get better as the number of hash functions used increases. The formal theorem is this:

Theorem: If $S$ is the set of distinct items in the stream and $n = |S|$ and $m > 100 n$, then with probability at least 2/3 the estimate $m / x_{\textup{min}}$ is between $n/6$ and $6n$.

We omit the proof (see below for references and better methods). As a quick analysis, since we’re only storing a constant number of integers at any given step, the algorithm has space requirement $O(\log m) = O(\log n)$, and each step takes time polynomial in $\log(m)$ to update in each step (since we have to compute multiplication and modulus of $m$).

This method is just the first ripple in a lake of research on this topic. The general area is called “streaming algorithms,” or “sublinear algorithms.” This particular problem, called cardinality estimation, is related to a family of problems called estimating frequency moments. The literature gets pretty involved in the various tradeoffs between space requirements and processing time per stream element.

As far as estimating cardinality goes, the first major results were due to Flajolet and Martin in 1983, where they provided a slightly more involved version of the above algorithm, which uses logarithmic space.

Later revisions to the algorithm (2003) got the space requirement down to $O(\log \log n)$, which is exponentially better than our solution. And further tweaks and analysis improved the variance bounds to something like a multiplicative factor of $\sqrt{m}$. This is called the HyperLogLog algorithm, and it has been tested in practice at Google.

Finally, a theoretically optimal algorithm (achieving an arbitrarily good estimate with logarithmic space) was presented and analyzed by Kane et al in 2010.

# A Quasipolynomial Time Algorithm for Graph Isomorphism: The Details

Update 2017-01-09: Laci claims to have found a workaround to the previously posted error, and the claim is again quasipolynoimal time! Updated arXiv paper to follow.

Update 2017-01-04: Laci has posted an update on his paper. The short version is that one small step of his analysis was not quite correct, and the result is that his algorithm is sub-exponential, but not quasipolynomial time. The fact that this took over a year to sort out is a testament to the difficulty of the mathematics and the skill of the mathematicians involved. Even the revised result is still a huge step forward in the analysis of graph isomorphism. Finally, this should reinforce how we should think about progress in mathematics: it comes in fits and starts, with occasional steps backward.

Update 2015-12-13: Laci has posted a preprint on arXiv. It’s quite terse, but anyone who is comfortable with the details sketched in this article should have a fine time (albeit a long time)  reading it.

Update 2015-11-21: Ken Regan and Dick Lipton posted an article with some more details, and a high level overview of how the techniques fit into the larger picture of CS theory.

Update 2015-11-16: Laci has posted the talk on his website. It’s an hour and a half long, and I encourage you to watch it if you have the time 🙂

Laszlo Babai has claimed an astounding theorem, that the Graph Isomorphism problem can be solved in quasipolynomial time (now outdated; see Update 2017-01-04 above). On Tuesday I was at Babai’s talk on this topic (he has yet to release a preprint), and I’ve compiled my notes here. As in Babai’s talk, familiarity with basic group theory and graph theory is assumed, and if you’re a casual (i.e., math-phobic) reader looking to understand what the fuss is all about, this is probably not the right post for you. This post is research level theoretical computer science. We’re here for the juicy, glorious details.

Note: this blog post will receive periodic updates as my understanding of the details improve.

Laci during his lecture. Photo taken by me.

Standing room only at Laci’s talk. My advisor in the bottom right, my coauthor mid-left with the thumbs. Various famous researchers spottable elsewhere.

## Background on Graph Isomorphism

I’ll start by giving a bit of background into why Graph Isomorphism (hereafter, GI) is such a famous problem, and why this result is important. If you’re familiar with graph isomorphism and the basics of complexity theory, skip to the next section where I get into the details.

GI is the following problem: given two graphs $G = (V_G, E_G), H = (V_H, E_H)$, determine whether the graphs are isomorphic, that is, whether there is a bijection $f : V_G \to V_H$ such that $u,v$ are connected in $G$ if and only if $f(u), f(v)$ are connected in $H$. Informally, GI asks whether it’s easy to tell from two drawings of a graph whether the drawings actually represent the same graph. If you’re wondering why this problem might be hard, check out the following picture of the same graph drawn in three different ways.

Indeed, a priori the worst-case scenario is that one would have to try all $n!$ ways to rearrange the nodes of the first graph and see if one rearrangement achieves the second graph. The best case scenario is that one can solve this problem efficiently, that is, with an algorithm whose worst-case runtime on graphs with $n$ nodes and $m$ edges is polynomial in $n$ and $m$ (this would show that GI is in the class P). However, nobody knows whether there is a polynomial time algorithm for GI, and it’s been a big open question in CS theory for over forty years. This is the direction that Babai is making progress toward, showing that there are efficient algorithms. He didn’t get a polynomial time algorithm, but he got something quite close, which is why everyone is atwitter.

It turns out that telling whether two graphs are isomorphic has practical value in some applications. I hear rumor that chemists use it to search through databases of chemicals for one with certain properties (one way to think of a chemical compound is as a graph). I also hear that some people use graph isomorphism to compare files, do optical character recognition, and analyze social networks, but it seems highly probable to me that GI is not the central workhorse (or even a main workhorse) in these fields. Nevertheless, the common understanding is that pretty much anybody who needs to solve GI on a practical level can do so efficiently. The heuristics work well. Even in Babai’s own words, getting better worst-case algorithms for GI is purely a theoretical enterprise.

So if GI isn’t vastly important for real life problems, why are TCS researchers so excited about it?

Well it’s known that GI is in the class NP, meaning if two graphs are isomorphic you can give me a short proof that I can verify in polynomial time (the proof is just a description of the function $f : V_G \to V_H$). And if you’ll recall that inside NP there is this class called NP-complete, which are the “hardest” problems in NP. Now most problems in NP that we care about are also NP-complete, but it turns out GI is not known to be NP-complete either. Now, for all we know P = NP and then the question about GI is moot, but in the scenario that most people believe P and NP are different, so it leaves open the question of where GI lies: does it have efficient algorithms or not?

So we have a problem which is in NP, it’s not known to be in P, and it’s not known to be NP-complete. One obvious objection is that it might be neither. In fact, there’s a famous theorem of Ladner that says if P is different from NP, then there must be problems in NP, not in P, and not NP-complete. Such problems are called “NP-intermediate.” It’s perfectly reasonable that GI is one of these problems. But there’s a bit of a catch.

See, Ladner’s theorem doesn’t provide a natural problem which is NP intermediate; what Ladner did in his theorem was assume P is not NP, and then use that assumption to invent a new problem that he could prove is NP intermediate. If you come up with a problem whose only purpose is to prove a theorem, then the problem is deemed unnatural. In fact, there is no known “natural” NP-intermediate problem (assuming P is not NP). The pattern in CS theory is actually that if we find a problem that might be NP-intermediate, someone later finds an efficient algorithm for it or proves it’s NP-complete. There is a small and dwindling list of such problems. I say dwindling because not so long ago the problem of telling whether an integer is prime was in this list. The symptoms are that one first solves the problem on many large classes of special cases (this is true of GI) or one gets a nice quasipolynomial-time algorithm (Babai’s claimed new result), and then finally it falls into P. In fact, there is even stronger evidence against it being NP-complete: if GI were NP-complete, the polynomial hierarchy would collapse. To the layperson, the polynomial hierarchy is abstruse complexity theoretic technical hoo-hah, but suffice it to say that most experts believe the hierarchy does not collapse, so this counts as evidence.

So indeed, it could be that GI will become the first ever problem which is NP-intermediate (assuming P is not NP), but from historical patterns it seems more likely that it will fall into P. So people are excited because it’s tantalizing: everyone believes it should be in P, but nobody can prove it. It’s right at the edge of the current state of knowledge about the theoretical capabilities and limits of computation.

This is the point at which I will start assuming some level of mathematical maturity.

## The Main Result

Theorem: There is a deterministic algorithm for GI which runs in time $2^{O(\log^c(n))}$ for some constant $c$.

This is an improvement over the best previously known algorithm which had runtime $2^{\sqrt{n \log n}}$. Note the $\sqrt{n}$ in the exponent has been eliminated, which is a huge difference. Quantities which are exponential in some power of a logarithm are called “quasipolynomial.”

But the main result is actually a quasipolynomial time algorithm for a different, more general problem called string automorphism. In this context, given a set $X$ a string is a function from $X$ to some finite alphabet (really it is a coloring of $X$, but we are going to use colorings in the usual sense later so we have to use a new name here). If the set $X$ is given a linear ordering then strings on $X$ really correspond to strings of length $|X|$ over the alphabet. We will call strings $x,y \in X$.

Now given a set $X$ and a group $G$ acting on $X$, there is a natural action of $G$ on strings over $X$, denoted $x^\sigma$, by permuting the indices $x^{\sigma}(i) = x(\sigma(i))$. So you can ask the natural question: given two strings $x,y$ and a representation of a group $G$ acting on $X$ by a set of generating permutations of $G$, is there a $\sigma \in G$ with $x^\sigma = y$? This problem is called the string isomorphism problem, and it’s clearly in NP.

Now if you call $\textup{ISO}_G(x,y)$ the set of all permutations in $G$ that map $x$ to $y$, and you call $\textup{Aut}_G(x) = \textup{ISO}_G(x,x)$, then the actual theorem Babai claims to have proved is the following.

Theorem: Given a generating set for a group $G$ of permutations of a set $X$ and a string $x$, there is a quasipolynomial time algorithm which produces a generating set of the subgroup $\textup{Aut}_G(x)$ of $G$, i.e. the string automorphisms of $x$ that lie in in $G$.

It is not completely obvious that GI reduces to the automorphism problem, but I will prove it does in the next section. Furthermore, the overview of Babai’s proof of the theorem follows an outline laid out by Eugene Luks in 1982, which involves a divide-and-conquer method for splitting the analysis of $\textup{Aut}_G(x)$ into simpler and simpler subgroups as they are found.

## Luks’s program

Eugene Luks was the first person to incorporate “serious group theory” (Babai’s words) into the study of graph isomorphism. Why would group theory help in a question about graphs? Let me explain with a lemma.

Lemma: GI is polynomial-time reducible to the problem of computing, given a graph $X$, a list of generators for the automorphism group of $G$, denoted $\textup{Aut}(X)$.

Proof. Without loss of generality suppose $X_1, X_2$ are connected graphs. If we want to decide whether $X_1, X_2$ are isomorphic, we may form the disjoint union $X = X_1 \cup X_2$. It is easy to see that $X_1$ and $X_2$ are isomorphic if and only if some $\sigma \in \textup{Aut}(X)$ swaps $X_1$ and $X_2$. Indeed, if any automorphism with this property exists, every generating set of $\textup{Aut}(G)$ must contain one.

$\square$

Similarly, the string isomorphism problem reduces to the problem of computing a generating set for $\textup{Aut}_G(x)$ using a similar reduction to the one above. As a side note, while $\textup{ISO}_G(x,y)$ can be exponentially large as a set, it is either the empty set, or a coset of $\textup{Aut}_G(x)$ by any element of $\textup{ISO}_G(x,y)$. So there are group-theoretic connections between the automorphism group of a string and the isomorphisms between two strings.

But more importantly, computing the automorphism group of a graph reduces to computing the automorphism subgroup of a particular group for a given string in the following way. Given a graph $X$ on a vertex set $V = \{ 1, 2, \dots, v \}$ write $X$ as a binary string on the set of unordered pairs $Z = \binom{V}{2}$ by mapping $(i,j) \to 1$ if and only if $i$ and $j$ are connected by an edge. The alphabet size is 2. Then $\textup{Aut}(X)$ (automorphisms of the graph) induces an action on strings as a subgroup $G_X$ of $\textup{Aut}(Z)$ (automorphisms of strings). These induced automorphisms are exactly those which preserve proper encodings of a graph. Moreover, any string automorphism in $G_X$ is an automorphism of $X$ and vice versa. Note that since $Z$ is larger than $V$ by a factor of $v^2$, the subgroup $G_X$ is much smaller than all of $\textup{Aut}(Z)$.

Moreover, $\textup{Aut}(X)$ sits inside the full symmetry group $\textup{Sym}(V)$ of $V$, the vertex set of the starting graph, and $\textup{Sym}(V)$ also induces an action $G_V$ on $Z$. The inclusion is

$\displaystyle \textup{Aut}(X) \subset \textup{Sym}(V)$

induces

$\displaystyle G_X = \textup{Aut}(\textup{Enc}(X)) \subset G_V \subset \textup{Aut}(Z)$

I.e.,

Call $G = G_V$ the induced subgroup of permutations of strings-as-graphs. Now we just have some subgroup of permutations $G$ of $\textup{Aut}(Z)$, and we want to find a generating set for $\textup{Aut}_G(x)$ (where $x$ happens to be the encoding of a graph). That is exactly the string automorphism problem. Reduction complete.

Now the basic idea to compute $\textup{Aut}_G(x)$ is to start from the assumption that $\textup{Aut}_G(x) = G$. We know it’s a subgroup, so it could be all of $G$; in terms of GI if this assumption were true it would mean the starting graph was the complete graph, but for string automorphism in general $G$ can be whatever. Then we try to refute this belief by finding additional structure in $\textup{Aut}_G(x)$, either by breaking it up into smaller pieces (say, orbits) or by constructing automorphisms in it. That additional structure allows us to break up $G$ in a way that $\textup{Aut}_G(x)$ is a subgroup of the product of the corresponding factors of $G$.

The analogy that Babai used, which goes back to graphs, is the following. If you have a graph $X$ and you want to know its automorphisms, one thing you can do is to partition the vertices by degree. You know that an automorphism has to preserve the degree of an individual vertex, so in particular you can break up the assumption that $\textup{Aut}(X) = \textup{Sym}(V)$ into the fact that $\textup{Aut}(X)$ must be a subgroup of the product of the symmetry groups of the pieces of the partition; then you recurse. In this way you’ve hugely reduced the total number of automorphisms you need to consider. When the degrees get small enough you can brute-force search for automorphisms (and there is some brute-force searching in putting the pieces back together). But of course this approach fails miserably in the first step you start with a regular graph, so one would need to look for other kinds of structure.

One example is an equitable partition, which is a partition of vertices by degree relative to vertices in other blocks of the partition. So a vertex which has degree 3 but two degree 2 neighbors would be in a different block than a vertex with degree 3 and only 1 neighbor of degree 2. Finding these equitable partitions (which can be done in polynomial time) is one of the central tools used to attack GI. As an example of why it can be very helpful: in many regimes a Erdos-Renyi random graph has asymptotically almost surely a coarsest equitable partition which consists entirely of singletons. This is despite the fact that the degree sequences themselves are tightly constrained with high probability. This means that, if you’re given two Erdos-Renyi random graphs and you want to know whether they’re isomorphic, you can just separately compute the coarsest equitable partition for each one and see if the singleton blocks match up. That is your isomorphism.

Even still, there are many worst case graphs that resist being broken up by an equitable partition. A hard example is known as the Johnson graph, which we’ll return to later.

For strings the sorts of structures to look for are even more refined than equitable partitions of graphs, because the automorphism group of a graph can be partitioned into orbits which preserve the block structure of an equitable partition. But it still turns out that Johnson graphs admit parts of the automorphism group that can’t be broken up much by orbits.

The point is that when some useful substructure is found, it will “make progress” toward the result by breaking the problem into many pieces (say, $n^{\log n}$ pieces) where each piece has size $9/10$ the size of the original. So you get a recursion in the amount of time needed which looks like $f(n) \leq n^{\log n} f(9n/10)$. If you call $q(n) = n^{\log n}$ the quasipolynomial factor, then solving the recurrence gives $f(n) \leq q(n)^{O(\log n)}$ which only adds an extra log factor in the exponent. So you keep making progress until the problem size is polylogarithmic, and then you brute force it and put the pieces back together in quasipolynomial time.

## Two main lemmas, which are theorems in their own right

This is where the details start to get difficult, in part because Babai jumped back and forth between thinking of the object as a graph and as a string. The point of this in the lecture was to illustrate both where the existing techniques for solving GI (which were in terms of finding canonical graph substructures in graphs) break down.

The central graph-theoretic picture is that of “individualizing” a vertex by breaking it off from an existing equitable partition, which then breaks the equitable partition structure so you need to do some more (polytime) work to further refine it into an equitable partition again. But the point is that you can take all the vertices in a block, pick all possible ways to individualize them by breaking them into smaller blocks. If you traverse these possibilities in a canonical order, you will eventually get down to a partition of singletons, which is your “canonical labeling” of the graph. And if you do this process with two different graphs and you get to different canonical labelings, you had to have started with non-isomorphic graphs.

The problem is that when you get to a coarsest equitable partition, you may end up with blocks of size $\sqrt{n}$, meaning you have an exponential number of individualizations to check. This is the case with Johnson graphs, and in fact if you have a Johnson graph $J(m,t)$ which has $\binom{m}{t}$ vertices and you individualize fewer than $m/10t$ if them, then you will only get down to blocks of size polynomially smaller than $\binom{m}{t}$, which is too big if you want to brute force check all individualizations of a block.

The main combinatorial lemma that Babai proves to avoid this problem is that the Johnson graphs are the only obstacle to finding efficient partitions.

Theorem (Babai 15): If $X$ is a regular graph on $m$ vertices, then by individualizing a polylog number of vertices we can find one of the three following things:

1. A canonical coloring with each color class having at most 90% of all the nodes.
2. A canonical equipartition of some subset of the vertices that has at least 90% of the nodes (i.e. a big color class from (1)).
3. A canonically embedded Johnson graph on at least 90% of the nodes.

[Edit: I think that what Babai means by a “canonical coloring” is an equitable partition of the nodes (not to be confused with an equipartition), but I am not entirely sure. I have changed the language to reflect more clearly what was in the talk as opposed to what I think I understood from doing additional reading.]

The first two are apparently the “easy” cases in the sense that they allow for simple recursion that has already been known before Babai’s work. The hard part is establishing the last case (and this is the theorem whose proof sketch he has deferred for two more weeks). But once you have such a Johnson graph your life is much better, because (for a reason I did not understand) you can recurse on a problem of size roughly the square root of the starting size.

In discussing Johnson graphs, Babai said they were a source of “unspeakable misery” for people who want to solve GI quickly. At the same time, it is a “curse and a blessing,” as once you’ve found a Johnson graph embedded in your problem you can recurse to much smaller instances. This routine to find one of these three things is called the “split-or-Johnson” routine.

The analogue for strings (I believe this is true, but I’m a bit fuzzy on this part) is to find a “canonical” $k$-ary relational structure (where $k$ is polylog in size) with some additional condition on the size of alternating subgroups of the automorphism group of the $k$-ary relational structure. Then you can “individualize” the points in the base of this relational structure and find analogous partitions and embedded Johnson schemes (a special kind of combinatorial design).

One important fact to note is that the split-or-Johnson routine breaks down at $\log^3(n)$ size, and Babai has counterexamples that say his result is tight, so getting GI in P would have to bypass this barrier with a different idea.

The second crucial lemma has to do with giant homomorphisms, and this is the method by which Babai constructs automorphisms that bound $\textup{Aut}_G(x)$ from below. As opposed to the split-or-Johnson lemma, which finds structure to bound the group from above by breaking it into simpler pieces. Fair warning: one thing I don’t yet understand is how these two routines interact in the final algorithm. My guess is they are essentially run in parallel (or alternate), but that guess is as good as wild.

Definition: A homomorphism $\varphi: G \to S_m$ is called giant if the image of $G$ is either the alternating group $A_n$ or else all of $S_m$. I.e. $\varphi$ is surjective, or almost so. Let $\textup{Stab}_G(x)$ denote the stabilizer subgroup of $x \in G$. Then $x$ is called “affected” by $\varphi$ if $\varphi|_{\textup{Stab}_g(x)}$ is not giant.

The central tool in Babai’s algorithm is the dichotomy between points that are affected and those that are not. The ability to decide this property in quasipolynomial time is what makes the divide and conquer work.

There is one deep theorem he uses that relates affected points to giant homomorphisms:

Theorem (Unaffected Stabilizer Theorem): Let $\varphi: G \to S_m$ be a giant homomorphism and $U \subset G$ the set of unaffected elements. Let $G_{(U)}$ be the pointwise stabilizer of $U$, and suppose that $m > \textup{max}(8, 2 + \log_2 n)$. Then the restriction $\varphi : G_{(U)} \to S_m$ is still giant.

Babai claimed this was a nontrivial theorem, not because the proof is particularly difficult, but because it depends on the classification of finite simple groups. He claimed it was a relatively straightforward corollary, but it appears that this does not factor into the actual GI algorithm constructively, but only as an assumption that a certain loop invariant will hold.

To recall, we started with this assumption that $\textup{Aut}_G(x)$ was the entire symmetry group we started with, which is in particular an assumption that the inclusion $\varphi \to S_m$ is giant. Now you want to refute this hypothesis, but you can’t look at all of $S_m$ because even the underlying set $m$ has too many subsets to check. But what you can do is pick a test set $A \subset [m]$ where $|A|$ is polylogarithmic in size, and test whether the restriction of $\varphi$ to the test set is giant in $\textup{Sym}(A)$. If it is giant, we call $A$ full.

Theorem (Babai 15): One can test the property of a polylogarithmic size test set being full in quasipolynomial time in m.

Babai claimed it should be surprising that fullness is a quasipolynomial time testable property, but more surprising is that regardless of whether it’s full or not, we can construct an explicit certificate of fullness or non-fullness. In the latter case, one can come up with an explicit subgroup which contains the image of the projection of the automorphism group onto the symmetry group of the test set. In addition, given two test sets $A,B$, one can efficiently compare the action between the two different test sets. And finding these non-full test sets is what allows one to construct the $k$-ary relations. So the output of this lower bound technique informs the upper bound technique of how to proceed.

The other outcome is that $A$ could be full, and coming up with a certificate of fullness is harder. The algorithm sketched below claims to do it, and it involves finding enough “independent” automorphisms to certify that the projection is giant.

Now once you try all possible test sets, which gives $\binom{m}{k}^2$ many certificates (a quasipolynomial number), one has to aggregate them into a full automorphism of $x$, which Babai assured us was a group theoretic exercise.

The algorithm to test fullness (and construct a certificate) he called the Local Certificates Algorithm. It was sketched as follows: you are given as input a set $A$ and a group $G_A \subset G$ being the setwise stabilizer of $A$ under $\psi_A : G_A \to \textup{Sym}(A)$. Now let $W$ be the group elements affected by $\psi_A$. You can be sure that at least one point is affected. Now you stabilize on $W$ and get a refined subgroup of $G_A$, which you can use to compute newly affected elements, growing $W$ in each step. By the unaffected stabilizer theorem, this preserves gianthood. Furthermore, in each step you get layers of $W$, and all of the stabilizers respect the structure of the previous layers. Babai described this as adding on “layers of a beard.”

The termination of this is either when $W$ stops growing, in which case the projection is giant and $W$ is our certificate of fullness (i.e. we get a rich family of automorphisms that are actually in our target automorphism group), or else we discover the projected ceases to be giant and $W$ is our certificate of non-fullness. Indeed, the subgroup generated by these layers is a subgroup of $\textup{Aut}_G(x)$, and the subgroup generated by the elements of a non-fullness certificate contain the automorphism group.

## Not enough details?

This was supposed to be just a high-level sketch of the algorithm, and Babai is giving two more talks elaborating on the details. Unfortunately, I won’t be able to make it to his second talk in which he’ll discuss some of the core group theoretic ideas that go into the algorithm. I will, however, make it to his third talk in which he will sketch the proof of the split-or-Johnson routine. That is in two weeks from the time of this writing, and I will update this post with any additional insights then.

Babai has not yet released a preprint, and when I asked him he said “soon, soon.” Until then 🙂

This blog post is based on my personal notes from Laszlo Babai’s lecture at the University of Chicago on November 10, 2015. At the time of this writing, Babai’s work has not been peer reviewed, and my understanding of his lectures has large gaps and may be faulty. Do not put your life in danger based on information in this post.