Testing Polynomial Equality

Problem: Determine if two polynomial expressions represent the same function. Specifically, if p(x_1, x_2, \dots, x_n) and q(x_1, x_2, \dots, x_n) are a polynomial with inputs, outputs and coefficients in a field F, where |F| is sufficiently large, then the problem is to determine if p(\mathbf{x}) = q(\mathbf{x}) for every x \in F, in time polynomial in the number of bits required to write down p and q.

Solution: Let d be the maximum degree of all terms in p, q. Choose a finite set S \subset F with |S| > 2d. Repeat the following process 100 times:

  1. Choose inputs z_1, z_2, \dots, z_n \in S uniformly at random.
  2. Check if p(z_1, \dots, z_n) = q(z_1, \dots, z_n).

If every single time the two polynomials agree, accept the claim that they are equal. If they disagree on any input, reject. You will be wrong with probability at most 2^{-100}.

Discussion: At first glance it’s unclear why this problem is hard.

If you have two representations of polynomials p, q, say expressed in algebraic notation, why can’t you just do the algebra to convert them both into the same format, and see if they’re equal?

Unfortunately, that conversion can take exponential time. For example, suppose you have a polynomial p(x) = (x+1)^{1000}. Though it only takes a few bits to write down, expressing it in a “canonical form,” often in the monomial form a_0 + a_1x + \dots + a_d x^d, would require exponentially many bits in the original representation. In general, it’s unknown how to algorithmically transform polynomials into a “canonical form” (so that they can be compared) in subexponential time.

Instead, the best we know how to do is treat the polynomials as black boxes and plug values into them.

Indeed, for single variable polynomials it’s well known that a nonzero degree d polynomial has at most d roots. A similar result is true for polynomials with many variables, and so we can apply that result to the polynomial p - q to determine if p = q. This theorem is so important (and easy to prove) that it deserves the name of lemma.

The Schwartz-Zippel lemma. Let p be a nonzero polynomial of total degree d \geq 0 over a field F. Let S be a finite subset of F and let z_1, \dots, z_n be chosen uniformly at random from S. The probability that p(z_1, \dots, z_n) = 0 is at most d / |S|.

Proof. By induction on the number of variables n. For the case of n=1, it’s the usual fact that a single-variable polynomial can have at most d roots. Now for the inductive step, assume this is true for all polynomials with n-1 variables, and we will prove it for n variables. Write p as a polynomial in the variable x_1, whose coefficients are other polynomials:

\displaystyle p(x_1, \dots, x_n) = \sum_{k=1}^d Q_k(x_2, \dots, x_n) x_1^k

Here we’ve grouped p by the powers of x_1, so that Q_i are the coefficients of each x_1^k. This is useful because we’ll be able to apply the inductive hypothesis to one of the Q_i‘s, which have fewer variables.

Indeed, we claim there must be some Q_k which is nonzero for k > 0. Clearly, since p is not the zero polynomial, some Q_k must be nonzero. If the only nonzero Q_k is Q_0, then we’re done because p doesn’t depend on x_1 at all. Otherwise, take the largest nonzero Q_k. It’s true that the degree of Q_k is at most d-k. This is true because the term x_1^k Q_k has degree at most d.

By the inductive hypothesis, if we choose z_2, \dots, z_n and plug them into Q_k, we get zero with probability at most \frac{d-k}{|S|}. The crucial part is that if this polynomial coefficient is nonzero, then the entire polynomial p is nonzero. This is true even if an unlucky choice of x_1 = z_1 causes the resulting evaluation p(z_1, \dots, z_n) \neq 0.

To think about it a different way, imagine we’re evaluating the polynomial in phases. In the first phase, we pick the z_2, \dots, z_n. We could also pick z_1 independently but not reveal what it is, for the sake of this story. Then we plug in the z_2, \dots, z_n, and the result is a one-variable polynomial whose largest coefficient is Q_k(z_1, \dots, z_n). The inductive hypothesis tells us that this one-variable polynomial is the zero polynomial with probability at most \frac{d-k}{|S|}. (It’s probably a smaller probability, since all the coefficients have to be zero, but we’re just considering the largest one for the sake of generality and simplicity)

Indeed, the resulting polynomial after we plug in z_2, \dots, z_n has degree k, so we can apply the inductive hypothesis to it as well, and the probability that it’s zero for a random choice of z_1 is at most k / |S|.

Finally, the probability that both occur can be computed using basic probability algebra. Let A be the event that, for these z_i inputs, Q_k is zero, and B the event that p is zero for the z_i and the additional z_1.

Then \textup{Pr}[B] = \textup{Pr}[B \textup{ and } A] + \textup{Pr}[B \textup{ and } !A] = \textup{Pr}[B \mid A] \textup{Pr}[A] + \textup{Pr}[B \mid !A] \textup{Pr}[!A].

Note the two quantities above that we don’t know are \textup{Pr}[B \mid A] and \textup{Pr}[!A], so we’ll bound them from above by 1. The rest of the quantities add up to exactly what we want, and so

\displaystyle  \textup{Pr}[B] \leq \frac{d-k}{|S|} + \frac{k}{|S|} = \frac{d}{|S|},

which proves the theorem.


While this theorem is almost trivial to prove (it’s elementary induction, and the obvious kind), it can be used to solve polynomial identity testing, as well as finding perfect matchings in graphs and test numbers for primality.

But while the practical questions are largely solved–it’s hard to imagine a setting where you’d need faster primality testing than the existing randomized algorithms–the theory and philosophy of the result is much more interesting.

Indeed, checking two polynomials for equality has no known deterministic polynomial time algorithm. It’s one of a small class of problems, like integer factoring and the discrete logarithm, which are not known to be efficiently solvable in theory, but are also not known to be NP-hard, so there is still hope. The existence of this randomized algorithm increases hope (integer factorization sure doesn’t have one!). And more generally, the fact that there are so few natural problems in this class makes one wonder whether randomness is actually beneficial at all. From a polynomial-time-defined-as-efficient perspective, can every problem efficiently solvable with access to random bits also be solved without such access? In the computational complexity lingo, does P = BPP? Many experts think the answer is yes.


Voltage, Temperature, and Harmonic Functions

This is a guest post by my friend and colleague Samantha Davies. Samantha is a math Ph.D student at the University of Washington, and a newly minted math blogger. Go check out her blog, With High Probability.

If I said “let’s talk about temperature and voltage”, you might be interested, but few would react the same if instead I suggested an umbrella term: harmonic functions.

Harmonic functions are often introduced as foreign operators that follow a finicky set of rules, when really they’re extremely natural phenomena that everyone has seen before. Why the disconnect then? A complex analysis class won’t take the time to talk about potential energy storage, and a book on data science won’t discuss fluid dynamics. Emphasis is placed on instances of harmonic functions in one setting, instead of looking more broadly at the whole class of functions.

Screenshot 2016-09-16 09.24.25.png

ams.org ©Disney/Pixar

Understanding harmonic functions on both discrete and continuous structures leads to a better appreciation of any setting where these functions are found, whether that’s animation, heat distribution, a network of political leanings, random walks, certain fluid flows, or electrical circuits.


Harmonic originated as a descriptor for something that’s undergoing harmonic motion. If a spring with an attached mass is stretched or compressed, releasing that spring results in the system moving up and down. The mass undergoes harmonic motion until the system returns to its equilibrium. These repetitive movements exhibited by the mass are known as oscillations. In physics lingo, a simple harmonic motion is a type of oscillation where the force that restores an object to its equilibrium is directly proportional to the displacement.



So, take the mass on a spring as an example: there’s some restoring force which works to bring the spring back to where it would naturally hang still. The system has more potential energy when it’s really stretched out from its equilibrium position, or really smooshed together from its equilibrium position (it’s more difficult to pull a spring really far out or push a spring really close together).

Equations representing harmonic motion have solutions which can be written as functions of sine and cosine, and these became known as harmonics.  Higher dimensional parallels of harmonics satisfy a certain partial differential equation called Laplace’s equation. As mathematicians are so skilled at using the same name for way too many things, today any real valued function with continuous second partial derivatives which satisfies Laplace’s equation is called a continuous harmonic function.

\Delta denotes the Laplace operator, also called the Laplacian. A continuous function u defined on some region \Omega satisfies Laplace’s equation when \Delta u=0, for all points in \Omega. My choice of the word “region” here is technical: being harmonic only makes sense in open, connected, nonempty sets.

The Laplacian is most easily described in terms of Cartesian coordinates, where it’s the sum of all u‘s unmixed second partial derivatives:

\begin{aligned} \Delta u(v)=\sum\limits_{i=1}^n \frac{\partial^2 f}{\partial x_i^2}(v). \end{aligned}

For example, u(x,y)=e^x \sin y is harmonic on all of \mathbb{R}^2 because

\begin{aligned} \Delta u(x,y)= \frac{\partial^2 f}{\partial x^2} \left (e^x \sin y\right ) + \frac{\partial^2 f}{\partial y^2} \left (e^x \sin y \right ) =e^x \sin y -e^x \sin y=0. \end{aligned}

Ironically, \cos(x) and \sin(x) are not harmonic in any region in \mathbb{R}, even though they define harmonic motion. Taking the second derivatives, \frac{d^2}{dx^2} \cos(x)=-\cos(x) and \frac{d^2}{dx^2} \sin(x)=-\sin(x) are only 0 at isolated points. In fact, harmonic functions on \mathbb{R} are exactly the linear functions.

The discrete Laplace operator is an analogue used on discrete structures like grids or graphs. These structures are all basically a set of vertices, where some pairs of vertices have an edge connecting them. A subset of vertices must be designated as the boundary, where the function has specified values called the boundary condition. The rest of the vertices are called interior.

A function u on a graph assigns a value to each interior vertex, while satisfying the boundary condition. There’s also a function \gamma which assigns a value to each edge in a way such that for every vertex, the sum of its outgoing edges’ values is 1 (without loss of generality, otherwise normalize arbitrary nonnegative weights). Keep in mind that the edge weight function is not necessarily symmetric, as it’s not required that \gamma(v,w)=\gamma(w,v). The discrete Laplacian is still denoted \Delta, and the discrete form of Laplace’s equation is described by:

\begin{aligned} (\Delta u)(v) = \sum\limits_{w \in N(v)} \gamma(v,w)\left( u(v)-u(w)\right)=0, \end{aligned}

where N(v), denotes the neighbors of v, which are the vertices connected to v by an edge.

Below is an example of a harmonic function \alpha on a grid which has symmetric edge weights. The function value is specified on the black boundary vertices, while the white interior vertices take on values that satisfy the discrete form of Laplace’s equation.


symmetric edge weights \\ function values

You can easily check that \alpha is harmonic at the corner vertices, but here’s the grosser computation for the 2 most central vertices on the grid (call the left one v_1 and the other v_2):

\begin{aligned} (\Delta \alpha)(v_1) &= \sum\limits_{w \in N(v_1)} \gamma(v_1,w)\left( \alpha(v_1)-\alpha(w)\right)=\frac{1}{4} \left( \frac{24}{55}-0\right)+\frac{1}{4} \left(\frac{24}{55}+4 \right)+\frac{3}{8} \left(\frac{24}{55}-\frac{64}{55}\right)+\frac{1}{8} \left(\frac{24}{55}-8\right )=0,\\ (\Delta \alpha)(v_2) &= \sum\limits_{w \in N(v_2)} \gamma(v_2,w)\left( \alpha(v_2)-\alpha(w)\right)=\frac{1}{8}  \left(\frac{64}{55}-8 \right )+\frac{3}{8} \left(\frac{64}{55}-0\right)+\frac{3}{8} \left(\frac{64}{55}-\frac{24}{55}\right)+\frac{1}{8} \left(\frac{64}{55}-0\right)=0. \end{aligned}

Properties of harmonic functions

The definition of a harmonic function is not enlightening… it’s just an equation. But, harmonic functions satisfy an array of properties that are much more helpful in eliciting intuition about their behavior. These properties also make the functions particularly friendly to work with in mathematical contexts.

Mean value property

The mean value property states that the value of a harmonic function at an interior point is the average of the function’s values around the point. Of course, the definition of “average” depends on the setting. For example, in \mathbb{R} harmonic functions are just linear functions. So given any linear function, say f(x)=3x-2, the value at a point can be found by averaging values on an interval around it:  f(1)= \frac14 \int_{-1}^{3} (3x-2)=1.

In \mathbb{R}^n, if u is harmonic in a region Ω which contains a ball centered around a, then u(a) is the average of u on the surface of the ball. So in the complex plane, or equivalently \mathbb{R}^2, if the circle has radius r_0, then for all 0 <r<r_0:

\begin{aligned} u(a)=\frac{1}{2 \pi} \int\limits_{0}^{2 \pi} u(a+re^{i \theta}) d \theta. \end{aligned}

In the discrete case, a harmonic function’s value on an interior vertex is the average of all the vertex’s neighbors’ values.  If u is harmonic at interior vertex v then the mean value property states that:

\begin{aligned} u(v)=\sum\limits_{w \in N(v)} \gamma(v,w)u(w). \end{aligned}

In the picture below, the black vertices are the boundary while the white vertices are interior, and the harmonic function \alpha has been defined on the graph. Here, \gamma(v,w)=\frac{1}{\deg(v)}, so this is an example with edge weights that are not necessarily symmetric. For instance, \gamma(v,w)=\frac14 while \gamma(w,v)=\frac13.


Foundations of Data Science pg 156

It’s easy to verify \alpha satisfies the mean value property at all interior vertices, but here’s the check for v:

\begin{aligned} \alpha(v)=\sum\limits_{x \in N(v)} \gamma(v,x)\alpha(x)=\frac14 \cdot 5+\frac14 \cdot 5+\frac14 \cdot 5+\frac14 \cdot 1=4. \end{aligned}

For the settings considered here, the mean value property is actually equivalent with being harmonic. If u is a continuous function which satisfies the mean value property on a region \Omega, then u is harmonic in \Omega (proof in appendix). Any function \alpha on a graph which satisfies the mean value property also satisfies the discrete Laplacian; just rearrange the mean value property to see this.

Maximum principle

The mean value property leads directly to the maximum/ minimum principle for harmonic functions. This principle says that a nonconstant harmonic function on a closed and bounded region must attain its maximum and minimum on the boundary. It’s a consequence that’s easy to see for continuous and discrete settings: if a max/min were attained at an interior point, that interior point is also the average of some set of surrounding points. But it’s impossible for an average to be strictly greater than or less than all of its surrounding points unless the function is constant.



Above are plots of x^2+y^2 and x^2-y^2 on the set [-3,3] \times [-3,3]. x^2+y^2 is nowhere harmonic and x^2-y^2 is harmonic everywhere, which means that x^2-y^2 must satisfy the maximum/ minimum principle but there’s no guarantee that x^2+y^2 does. Looking at the graph above, x^2+y^2  doesn’t satisfy the the minimum principle because the minimum of x^2+y^2 on the domain is 0, which is achieved at the interior point (0,0). On the same set, the maximum of x^2-y^2 is 9, which is achieved at boundary points (3,0) and (-3,0), and the minimum of x^2-y^2 is -9, which is achieved at boundary points (0,3) and (0,-3).


If a harmonic function is continuous on the boundary of a closed and bounded set and harmonic in the interior, then the values of the interior are uniquely determined by the values of the boundary. The proof of this is identical for the discrete and continuous case: if u_1 and u_2 were two such functions described above with the same boundary conditions, then u_1-u_2 is harmonic and its boundary values are 0. By the max/min principle, all the interior values are also 0, so u_1 = u_2 on the whole set.

Solution to a Dirichlet problem

Dirichlet problem is the problem of finding a function which solves some partial differential equation through out a region given its values on the region’s  boundary. When considering harmonic functions in the discrete or continuous setting, the PDE to be solved is the appropriate form of Laplace’s equation.

So, given a region and a set of values to take on the boundary, does there exist a function which takes the designated values on the boundary and is harmonic in the interior of the region? The answer in the continuous setting is yes, so long as the boundary condition is sufficiently smooth. In the discrete setting, it’s much more complicated, especially on networks with infinitely many vertices. For most practical instances though, like grids or finite graphs with edge weight function \gamma(v,w)=\frac{1}{\deg(v)}, the answer is still yes.


I’ll go into detail about a harmonic function on a continuous structure, steady state temperature, and on a discrete structure, voltage. These phenomena are already familiar to us, but weaving previous knowledge with a new appreciation for harmonic functions is crucial to understanding how they, and other natural instances of harmonic functions, behave.


Suppose you have an insulated object, like a wire, plate, or cube, and apply some temperature on its boundary. The temperature on the boundary will affect the temperature of the interior, and eventually these interior temperatures stabilize to some distribution which doesn’t change as time continues. This is known as the steady state temperature distribution.

A temperature function u of spatial variables x_i and time variable t, satisfies the heat equation

\begin{aligned} \frac{\partial u}{\partial t}-\alpha \Delta u=0, \end{aligned}

where \alpha>0 is a constant which represents the material’s ability to store and conduct thermal energy (these modules for complex analysis have a nice derivation of the heat equation ). The requirement that the temperature distribution is steady state yields the necessary condition that \frac{\partial u}{\partial t}=0 for all t,x_i. So, functions solving the heat equation must satisfy \Delta u=0, which means they are harmonic.

Intuitively, this is no surprise because any continuous function which satisfies the mean value property is harmonic. Still, temperature is a great example because it creates a familiar visual for a continuous harmonic function. Below is a picture of a steady state temperature distribution on a square metal plate of length \pi millimeters, where the bottom edge has temperature set to \pi ℃ and all other edges are set to 0℃.



Since in continuous settings the Dirichlet problem always has a unique solution for harmonic functions, there’s a unique temperature function which takes the designated values on the boundary and is harmonic in the interior of the region. That function is:

\begin{aligned} u(x,y)=\sum\limits_{n=0}^{\infty} \frac{4}{2n+1} \sin((2n+1) x) \left (\frac{\sinh((2n+1)(\pi-y))}{\sinh((2n+1) \pi)} \right ). \end{aligned}

(Here’s the derivation of that equation, but it’s too much Diff EQs for here). So the value of u at any point in the interior, like (1.5,0.5), can be calculated:

 \begin{aligned} u(1.5,0.5) = \sum\limits_{n=0}^{\infty} \frac{4}{2n+1} \sin((2n+1) 1.5) \left (\frac{\sinh((2n+1)(\pi-0.5))}{\sinh((2n+1) \pi)} \right ) \approx 2.17. \end{aligned}


An electrical network can easily be abstracted to a graph. Any point in the network where circuit elements meet is a vertex, and the wires (or other channels) are edges. Each wire has some amount of conductance. As the name suggests, conductance is a measure for how easy it is for electric charge to pass through the wire.

Edges in the graph are assigned nonnegative values representing the wires’ conductance relative to the total conductance emanating from the vertex. These values are the edge weights in the graph, and again they’re not necessarily symmetric. If c_{v,w} is the conductance of the edge between v and w and c_v=\sum_{x:N(v)}c_{v,x} is the total conductance emanating from v then, \gamma(v,w)=\frac{c_{v,w}}{c_v} while \gamma(w,v)=\frac{c_{v,w}}{c_w}.

The ground node in a network is an arbitrary node in the circuit picked to have an electrical potential energy per charge of 0 volts. So, the voltage at a node is the difference in electrical potential energy per charge between that node and the ground node.

The picture below shows an electrical network after the positive and negative terminals of a battery have been attached to two nodes. In this network, the conductance for all the wires is the same, so \gamma(v,w)=\frac{1}{\deg(v)} for all edges (v,w). When the positive and negative terminals of a battery are attached to two nodes in a circuit, a flow of electric charge, called current, is created and carried by electrons or ions.

The vertices representing the nodes attached to the battery are the boundary vertices, and here the node with the negative terminal attached to it as the ground node. After the battery is attached, voltage is induced on the rest of the nodes. The voltage taken on by each node minimizes the total energy lost as heat in the system.


source: Simons Foundation

The reader can quickly double check that the function is harmonic at all the interior points, or equivalently showing that the mean value property holds. After a couple extra facts about current it’s easy to show that the voltage at a node is always the average of its neighbors’ voltages, making voltage a harmonic function (proof in appendix). 

Wrap it up

At the end of the day, understanding harmonic functions makes you a better scientist (or better animation artist, apparently). If you want to understand phenomena like potential energy, data modeling, fluid flow, etc, then you’ve got to understand the math that controls it! Check out the appendix for proofs and extra facts that fell somewhere in the intersection of “things I didn’t think you’d care about” and “things I felt guilty leaving out”.


There are tons of references with more information about discrete harmonic functions, but I’m specifically interesting in those relating random walks and electrical networks. Here is a book whose 5th chapter presents some introductory material. Also this, from the Simons Foundation, and that, a paper from Benjamini and Lovász, are sources which dive deeper. The piece from the Simons Foundation further explains the relation between harmonic functions and minimizing the loss of energy in a network, like the voltage example with heat.

Again, here is an article about harmonic functions in animation.

These complex analysis modules, have sections about temperature distribution and fluid flow. They’re a pretty good tool for all your complex analysis needs, including some sections on harmonic functions in complex analysis.

Harmonic \Longleftrightarrow mean value property

This is obvious for the discrete case, as the discrete Laplacian can be rewritten to show that the function’s value at a point is just the average of its neighbors. Given a continuous function, u, which satisfies the mean value property throughout a disk, there exists (by Poisson’s formula) a function, v, harmonic in the same disk and equal to u on the boundary. By uniqueness, u=v through out the disk and u is harmonic in the disk. For anyone concerned about a region which isn’t a disk, the general fact is that the Dirichlet problem for harmonic function always has a solution.

Proof that voltage is a harmonic function

Voltage satisfies the mean value property and is therefore harmonic. As before c_{u,v} is the conductance of the edge between u and v and c_u is the total conductance emanating from u, c_u =\sum_{x:N(u)}c_{u,x} . Let p_{u,v}=\frac{c_{uv,}}{c_u} be the edge value on the edge from u to v in the graph.

Let i_{u,x} denote the current flowing through node u to node x, v_{u} denote the voltage at node u, c_{u,x} denote the conductance of the edge between u and x, and r_{u,x}=\frac{1}{c_{u,x}} denote the resistance of the edge between u and x. Ohm’s law relates current, resistance, and voltage by stating that

\begin{aligned} i_{u,x} = \frac{v_u-v_x}{r_{u,x}}=\left (v_u-v_x \right )c_{u,x} \end{aligned}.

Kirchoff’s law states that the total current from a node is 0, so for any u in the network

\begin{aligned} \sum\limits_{x} i_{u,x}=0. \end{aligned}

Now, we’re ready to show that voltage is a harmonic function.

\begin{aligned} \sum\limits_{x} i_{u,x}=0 \qquad &\Longleftrightarrow \qquad \sum\limits_{x} \left ( v_u-v_x \right ) c_{u,x}=0 \\ \Longleftrightarrow \qquad  \sum\limits_{x}  v_u c_{u,x}= \sum\limits_{x} v_x  c_{u,x} \qquad &\Longleftrightarrow \qquad v_u c_u=\sum\limits_{x} v_x p_{u,x}c_{u}\\ &\Longleftrightarrow \qquad v_u=\sum\limits_{x} v_x p_{u,x}. \end{aligned}

Guest post, “What’s up with graph Laplacians?”

I wrote a guest post for my friend Samantha Davies’s blog, With High Probability. It’s called, What’s up with graph Laplacians?

Go check it out!

Zero-Knowledge: Definitions and Theory

The next Monday, when the fathers were all back at work, we kids were playing in a field. One kid says to me, “See that bird? What kind of bird is that?” I said, “I haven’t the slightest idea what kind of a bird it is.” He says, “It’s a brown-throated thrush. Your father doesn’t teach you anything!” But it was the opposite. He had already taught me: “See that bird?” he says. “It’s a Spencer’s warbler.” (I knew he didn’t know the real name.) “Well, in Italian, it’s a Chutto Lapittida. In Portuguese, it’s a Bom da Peida. In Chinese, it’s a Chung-long-tah, and in Japanese, it’s a Katano Tekeda. You can know the name of that bird in all the languages of the world, but when you’re finished, you’ll know absolutely nothing whatever about the bird. You’ll only know about humans in different places, and what they call the bird. So let’s look at the bird and see what it’s doing—that’s what counts.” I learned very early the difference between knowing the name of something and knowing something.

In the first post in this series, we defined and implemented a simple zero-knowledge proof for graph isomorphism. In the second post, we saw a different protocol with a much heftier tagline: it gives a zero knowledge proof for any problem in the class NP. These two protocols used the same underlying framework—an interaction between a prover and a verifier—but they were actually very different.

Indeed, the graph isomorphism proof was “perfect” in two senses. First, it didn’t require any assumptions about cryptography. Nobody knows how to prove the Blum-Blum-Shub function is actually a one-way permutation (at the least, this would imply \textup{P} \neq \textup{NP}, so it’s probably hard to prove).

Second, in the graph isomorphism proof, the simulator produced transcripts of the protocol that came from the exact same distribution as the true transcripts created by the prover and verifier. This is why we called it zero-knowledge; anything the verifier can do with the output of the protocol, the simulator could do too. The verifier can’t be making use of the prover’s secret knowledge, since that information isn’t even in the simulator’s universe but the simulator can still compute what the verifier can.

But I didn’t tell you precisely why the 3-coloring protocol is a zero-knowledge proof, and in particular why it’s different from the graph isomorphism protocol. I also hinted that I had been misleading you, dear reader, as to the full 3-coloring proof of zero-knowledge. So in this post we’ll get into those nitty-gritty definitions that make the math rigorous. We’ll give a short sketch of the proof of zero-knowledge (the full proof would take many pages, not because it’s hard but because there are a lot of annoying details). And then we’ll give an overview of the landscape of theorems and conjectures about zero knowledge.

Information vs. knowledge

You can’t understand where the following definitions come from without the crucial distinction between information and knowledge from the computer scientist’s perspective. Information concerns how many essential bits are encoded in a message, and nothing more. In particular, information is not the same as computational complexity, the required amount of computational resources required to actually do something. Knowledge, on the other hand, refers to the computational abilities you gain with the information provided.

Here’s an example in layman’s terms: say I give you a zero-knowledge proof that cancer can be cured using a treatment that takes only five days. Even though I might thoroughly convince you my cure works by exhibiting patients with vanishing tumors, you’ll still struggle to find a cure. This is despite the fact that there might be more bits of information relayed in the messages sent during my “zero-knowledge proof” than the number of bits needed to describe the cure! On the other hand, every proof that 1+1=2 is a zero-knowledge proof, because it’s not computationally difficult to prove this on your own in the first place. You don’t gain any new computational powers even if I tell you flat out what the proof is.

Knowledge is Richard Feynman’s worldview, information is knowing the many names for a bird.

From this perspective information theory is abstruse, though it’s much easier to prove theoretical limits on computation in terms of information theory than in terms of computational complexity. However, the theoretical limits of computation are lower that the limits of information theory, which is the essence of what enables cryptography (from an information theory standpoint, all public-key cryptography is broken!). Reframing what’s possible in terms of computation opens a treasure chest of useful algorithms.

From here we’ll explore three definitions of zero-knowledge. First, we’ll see “perfect” zero knowledge, which is believed to be too strong to be practical. We’ll give a sketch of the “right” proof that graph isomorphism has a perfect zero-knowledge proof. Then we’ll see two relaxations to “statistical” and “computational” zero knowledge. We’ll discuss their theoretical relationships to each other and the rest of the world from a high level.

Perfect zero-knowledge

Recall from our first post, the interactive protocol for graph isomorphism has a very nice property. Say you took all the messages sent back and forth between the prover and verifier, and say you think of those messages as a random outcome of a probabilistic event. Then a simulator could have produced a set of messages drawn from the same distribution, without needing to see the messages from the protocol in the first place! So anything that the verifier could compute as a result of participating in the protocol, the simulator could compute without needing the protocol in the first place. The simulator just needs to assume the truth of the claim. This type of zero-knowledge is called perfect zero-knowledge.

Let’s distill this into some notation. In order to do this, we’ll shift to the theoretical way of discussing a “problem” as a language, i.e., a set of strings. For example, for 3-coloring you fix a method for describing graphs as strings (it doesn’t really matter which method you use), and the language is the set of strings

\displaystyle \{ G : G \textup{ has a valid 3-coloring } \}

Membership in the set is the same as saying an instance of the problem “does G have a 3-coloring?” has a yes answer. Throughout the entire discussion we will fix a generic way to encode any object as a binary string, and we’ll implicitly equate G with the string representing G, and call them both G. In this post I’ll always use L for a language.

Now say that P is an algorithm for the prover (a probabilistic Turing machine of arbitrary complexity), V is the analogous polynomial-time algorithm for the verifier, and M(P,V) is the resulting concatenated string of messages from one run of a zero-knowledge proof. Because P and V may be randomized, M(P,V) is a distribution over strings \{ 0, 1\}^m, and one run of the protocol is equivalent to drawing y \sim \{0,1\}^m according to this distribution. This just means: run an instance of the the protocol, which is random, and encode the output messages as a single string.

Now define a simulator as any probabilistic polynomial-time turing machine S. We interpret the simulator as trying to reproduce the output of the protocol without having access to the protocol’s messages, but with computational limits. The prover, verifier, and simulator all take as input the claim x of length n, and the verifier and simulator run in time polynomial in n. Reminder: in order to be an interactive proof for the language L, if x \in L then the verifier must be convinced with probability 1, and if x \not \in L then the verifier must reject with probability at least 1/2.

Definition (not quite correct): An interactive proof M(P,V) for a language L is said to have perfect zero-knowledge if some simulator S exists such that for every input x \in L, the distributions of the random variables M(P,V)(x) and S(x) are equal.

This definition isn’t quite correct, but it’s correct in spirit so that’s why we used it in our first post. Now let’s see a bit more detail about what really makes a zero-knowledge proof. Let’s remind ourselves of the argument for graph isomorphism, and then see why it’s not strong enough. The protocol we described is summed up as: prover has G_1, G_2, and sends a random permutation of G_1 to the verifier, calls it H. The verifier picks a random integer k which is either 1 or 2, sends it the prover. The prover produces an isomorphism between G_k and H, and sends the isomorphism back to the verifier, who checks to make sure the isomorphism is valid and accepts (otherwise, rejects).

In our argument, we said the simulator could simulate messages by picking k for the verifier before choosing the random permutation H, and this is the nit we need to pick. The problem is that the prover doesn’t trust the verifier, but the simulator does trust the verifier, and performs actions for the verifier that are faithful to the protocol!

What if a naughty verifier were able to glean more information by misbehaving in the protocol? Say, instead of randomly choosing k as 1 or 2, the verifier might always just pick k=1. Could such a misbehavior give the verifier more knowledge than strictly following the protocol?

Sure, maybe if the verifier did that, he wouldn’t end up with a valid proof (his acceptances and rejections may not be valid or sound), but it might be worth it if he can get a small bit of information about the hidden isomorphism. The goal is to treat the verifier like a hacker, but our simulator only simulates honest verifiers.

So a real zero-knowledge proof needs to account for a misbehaving verifier. One way to do this is to give the simulator black-box access to the verifier V, and prove that the distribution of messages produced by the simulator S(V, x) is equal to the true distribution of messages produced in the actual protocol with the same (possibly misbehaving) verifier, M(P,V)(x). Note that S can still know about the protocol that V is supposed to follow, but he can’t know exactly how V is defined.

Unfortunately, there is no known zero-knowledge proof for a nontrivial problem that satisfies this stronger criterion for perfect zero knowledge, and the graph isomorphism proof in particular doesn’t satisfy it. The way to fix it, which I’ll just say informally, is to allow the simulator to fail with probability at most 1/2. So to summarize, the simulator gets black-box access to the verifier and is guaranteed an input x \in L, and has to sample from the exact distribution of messages as the real protocol, but the simulator is allowed to say “KAPUT!” at most half the time, when it thinks the output will be bad. The distribution of simulator outputs (conditioned on no-KAPUT) should be equal to the true distribution of outputs.

Now we can sketch the real graph isomorphism proof. The simulator is defined as follows:

  1. Pick i \in \{ 1, 2 \} randomly and let H be a random permutation of G_i.
  2. Invoke the verifier V(H) as a subroutine with input H, and call the output k, which is either 1 or 2.
  3. If k \neq i, output “KAPUT!”, otherwise invoke V as a subroutine with the isomorphism H \to G_i as input.

Now to show the message distributions are equal. The message sent in step 2 is produced with a different process from the actual zero knowledge protocol (since the prover just takes a random isomorphic copy of G_1), but since we know G_1, G_2 are isomorphic, the distribution over H is the same in both cases.

Now when we invoke the verifier subroutine, we can’t know what the verifier will choose. But the point is that regardless of what the verifier does, there’s still a 50% chance it will pick k=i since we chose i randomly and independently of the definition of V.

If you ignore that we need to prove a few probability-theory facts (like that the choice of H really is uniformly random), then this completes the proof.


Statistical and computational zero-knowledge

While perfect zero-knowledge is the strongest form of zero-knowledge, it’s believed to be too strong. A more tractable definition is called statistical zero-knowledge, and for this we relax the requirement that the message distributions are equal to the requirement that they’re “very similar.” Informally, by “very similar” I mean that if the output strings have length m, then the sum of the differences of the probabilities is “basically” exponentially small in m.

More formally, if D_1, D_2 are two distributions over \{0,1\}^m, then we say they’re statistically close if the following distance function vanishes (asymptotically in m) faster than 1/m^c for any c > 0.

\displaystyle d(D_1, D_2) = \sum_{x \in \{0,1\}^m} \left | \Pr_{y \sim D_1}[y=x] - \Pr_{y \sim D_2}[y=x] \right |

This distance function just sums, for every outcome, the difference between the probabilities of those outcomes. In other words, being statistically close means that the two distributions can’t disagree on any particular input too much, nor can they disagree on most the outputs more than a tiny bit.

In little-o notation, statistically close means d(D_1, D_2) = o(1/m^c) for every constant c > 0. It could be like 2^{-m} or something vanishing much slower like 1/m^{\log \log m}. A function of m which has this property is called negligible, and it’s the standard choice of a security guarantee for cryptography.

Definition: Using the same notation as perfect zero-knowledge, an interactive proof is said to have statistical zero-knowledge if there is a simulator such that for every input x \in L, the distributions of M(P,V)(x) and S(x) are statistically close.

Interestingly, for this definition and the next (computational zero-knowledge), giving the simulator the power to output KAPUT with probability 1/2 doesn’t add any power. It turns out that any KAPUTful simulator can be transformed into a KAPUTless simulator with polynomial overhead. A sketch of how this works: run the simulator until either it does not KAPUT, or else you’ve tried n^2 times with all KAPUTs. The latter event happens with exponentially small probability, so in that case the KAPUTless simulator outputs a uniformly random string (it’s a “don’t care”). This maintains statistical closeness (and, as we’ll see, computational indistinguishability).

Finally, the weakest form of zero-knowledge isn’t so much a property of the two message distributions, but a property of a computationally limited algorithm trying to distinguish between them. We model this by having a probabilistic polynomial-time algorithm A which accepts as input the ability to draw from a distribution, and A produces as output a single bit. We interpret this bit as A‘s claim that the two distributions are “different.A‘s goal is to answer correctly. If A can do that with non-negligible probability, then A wins and “distinguishes” between the two input distributions. So we’ll call A the distinguisher.  Note that A can do things like draw a \textup{poly}(m) number of samples and do any polynomial-time computation with those samples.

Definition: An interactive proof M(P,V) is said to have computational zero-knowledge if there is a simulator S such that for every input x, and for every probabilistic polynomial-time algorithm A, the following probability is negligible.

\displaystyle \Pr[A(M(P,V)(x)) \neq A(S(x))]

In words, A fails to distinguish between M(P,V)(x) and S(x) with a significant (non-negligible) edge over random guessing. Another way to say it is that the distributions A(M(P,V)(x)) and A(S(x)) are statistically close. So the simulator can’t necessarily produce a statistically close distribution on messages, but the simulator can produce a distribution that fools a computationally limited observer.

If general, two distributions D_1, D_2 are called computationally indistinguishable if they don’t have a distinguisher. I.e., no polynomial-time adversary can distinguish between them with non-negligible probability, in the same way we described above.

So our landscape consists of three classes of problems nested as follows:

Perfect ZK \subset statistical ZK \subset Computational ZK

An interesting question we’ll discuss by towards the end of this post is whether these classes of problems are actually different.

Back to 3-coloring

This computational distinguishing property shows up all over cryptography. The reason is simple: almost all cryptographic hardness assumptions can be rephrased as the computational indistinguishability properties. The hardness of one-way functions, predicting pseudorandom generators, all of it is the inability of a polynomial-time adversary to solve a problem.

From this perspective, we can already see why it should be obvious that the zero-knowledge proof for 3-coloring has computational zero knowledge: we used crypto to commit to a coloring, and revealed the secret key later. If the verifier could break the zero-knowledge aspect, then they could defeat the underlying cryptographic primitive. The formal proof of computational zero-knowledge is a drawn-out, detail-oriented quest to entrench this idea in formalism and fit the simulator definition.

Nevertheless, it’s interesting to see where precisely the assumption makes its way into the simulator. Reminder of the protocol:

  1. The prover has a 3-coloring, and publicly commits to a random permutation of that coloring, sends it to the verifier.
  2. The verifier picks an edge uniformly at random, sends it to the prover.
  3. The prover reveals the colors for that edge, sends them to the verifier.
  4. The verifier checks that the revealed colors were indeed the ones committed to, and accepts if they are different colors.

Here’s the graph 3-coloring simulator, which involves some KAPUTing:

  1. Pick n colors c_1, \dots, c_n \in \{ 1,2,3 \} uniformly at random. Commit to these as the “coloring.”
  2. Invoke the verifier with the committed “coloring” as input, and receive an edge (i,j) as output.
  3. If c_i = c_j, KAPUT, else, run the verifier as a subroutine with the true c_i, c_j.

Note that if an honest verifier picks the edge randomly, then it will be properly colored with probability 2/3, and the simulator wins.

A dishonest verifier is trickier, because it gets access to the entire committed coloring. A priori there might be a devious way to select a bad edge. So let’s suppose for the sake of contradiction that the verifier has a method for choosing an improperly colored edge. While it takes a few steps to get here (and a lot more detail), in the end this means that the verifier can determine, with probability better at least \frac{1}{3} + \frac{1}{n^c} for some constant c, the difference between a commitment of the bit zero and a commitment of the bit one.

In other words, we can use a verifier that breaks zero-knowledge of this protocol as a bit-commitment-scheme breaker! But the verifier is a probabilistic polynomial-time algorithm, which contradicts the security assumption of a bit-commitment scheme.

Actually, I don’t think I ever formally said what the security assumption for a bit commitment scheme is, since in the previous posts I hadn’t provided the definition of computational indistinguishability. So here it is: a bit-commitment scheme, which is defined by a one-way permutation G with a hard-core predicate f, maps b \mapsto (G(x), f(x) \oplus b) and has the property that the distributions committing to zero and one are computationally indistinguishable. That is, the following two distributions:

\textup{commit}_n(0) = \{ (G(x), f(x) \oplus 0): x \in \{ 0,1 \}^n \}

\textup{commit}_n(1) = \{ (G(x), f(x) \oplus 1): x \in \{ 0,1 \}^n \}

So the verifier can pick a bad edge with probability as most 1/3 + \textup{negligible}, which is far less than 1/2. Then you can use the trick we showed earlier to get a good enough simulator that never KAPUTs. And that proves computational zero-knowledge.

Theorems and limitations

Here’s an interesting theorem:

Theorem: Any zero-knowledge protocol which doesn’t use randomization (on both sides) can be solved in randomized polynomial time (is in BPP).

So the random choices made by the prover/verifier are actually essential to the novelty of this concept.

I haven’t yet said anything about statistical zero-knowledge beyond the definition. It turns out there are a lot of interesting problems with statistical zero-knowledge proofs. I should really expand the discussion of statistical zero knowledge into its own post, but I want to wrap up this series and explore some other projects.

One such problem is pretty meta: the problem of telling whether two distributions are statistically close. In fact, this problem is complete for the class of problems with statistical zero-knowledge proofs, in the same sense that 3-coloring is NP-complete; any problem which has a statistical zero-knowledge proof can be reduced to a question about statistical closeness of two specific distributions. The class of problems with statistical zero-knowledge proofs is usually shortened to SZK (along with PZK for perfect zero-knowledge and CZK for computational zero-knowledge).

In addition to being interesting from a theoretical perspective, it’s a very practical question to study: if you can sample from a distribution efficiently, which properties of that distribution can you estimate efficiently? The completeness result says that the inherent complexity of this class of problems is captured by statistical zero-knowledge proofs.

As a theoretical object of study, SZK has massive advantages over perfect/computational zero-knowledge. In particular,

Theorem: Every interactive proof that has statistical zero-knowledge against an honest verifier can be transformed into a statistical zero-knowledge proof against an arbitrary adversary.

So the entire “gotcha” content of this post, which we had to worry about with perfect zero-knowledge, doesn’t exist for statistical zero-knowledge. It still exists for computational zero-knowledge, as far as I know, because it’s not know that every computational zero-knowledge proof also has statistical zero-knowledge (a stronger property).

On another front, people are relatively convinced that perfect zero-knowledge is too restrictive to be useful. This comes from the following theorem due to Fortnow

Theorem: If an NP-complete problem has a statistical zero-knowledge proof, then the polynomial hierarchy collapses.

Without going into detail about what the polynomial hierarchy is, suffice it to say that most people believe it’s unlikely. And because this means that SZK probably does not contain all of NP, we can ask natural followup questions like, can problems in SZK be solved by polynomial-time quantum algorithms? It’s also open whether these can solve NP-complete problems.

Statistical zero-knowledge can be specialized even further.

Theorem: Every problem with a statistical zero-knowledge proof has a statistical zero-knowledge proof in which the only messages the verifier sends are random coin flips.

This is a so-called “public coin” protocol, which makes zero-knowledge proofs much easier to reason about.

For further reading, see this introduction by Salil Vadhan, arguably the world’s leading expert on the topic. Particularly interesting is sections 3-5 in which Vadhan describes more detail about how zero-knowledge has been used as an “interface” for techniques to pass between complexity theory and cryptography.

For further reading with on the excruciating details of the definitions and proofs presented in this series, see Oded Goldreich’s Foundations of Cryptography (Volume I, Chapter 4). I have mixed feelings about this book because, despite the fact that I’ve found it enlightening, the book somehow manages to be simultaneously dense and scattered. It leaves no snag unsmoothed, for better or worse.

Until next time!