Encoding Schemes in FHE

In cryptography, we need a distinction between a cleartext and a plaintext. A cleartext is a message in its natural form. A plaintext is a cleartext that is represented in a specific way to prepare it for encryption in a specific scheme. The process of taking a cleartext and turning it into a plaintext is called encoding, and the reverse is called decoding.

In homomorphic encryption, the distinction matters. Cleartexts are generally all integers, though the bit width of allowed integers can be restricted (e.g., 4-bit integers). On the other hand, each homomorphic encryption (HE) scheme has its own special process for encoding and decoding, and since HEIR hopes to support all HE schemes, I set about cataloguing the different encoding schemes. This article is my notes on what they are.

If you’re not familiar with the terms Learning With Errors LWE and and its ring variant RLWE, then you may want to read up on those Wikipedia pages first. These problems are fundamental to most FHE schemes.

Bit field encoding for LWE

A bit field encoding simply places the bits of a small integer cleartext within a larger integer plaintext. An example might be a 3-bit integer cleartext placed in the top-most bits of a 32-bit integer plaintext. This is necessary because operations on FHE ciphertexts accumulate noise, which pollutes the lower-order bits of the corresponding plaintext (BGV is a special case that inverts this, see below).

Many papers in the literature will describe “placing in the top-most bits” as “applying a scaling factor,” which essentially means pick a power of 2 $\Delta$ and encode an integer $x$ as $\Delta x$. However, by using a scaling factor it’s not immediately clear if all of the top-most bits of the plaintext are safe to use.

To wit, the CGGI (aka TFHE) scheme has a slightly more specific encoding because it requires the topmost bit to be zero in order to use its fancy programmable bootstrapping feature. Don’t worry if you don’t know what it means, but just know that in this scheme the top-most bit is set aside.

This encoding is hence most generally described by specifying a starting bit and a bit width for the location of the cleartext in a plaintext integer. The code would look like

plaintext = message << (plaintext_bit_width - starting_bit - cleartext_bit_width)

There are additional steps that come into play when one wants to encode a decimal value in LWE, which can be done with fixed point representations.

As mentioned above, the main HE scheme that uses bit field LWE encodings is CGGI, but all the schemes use this encoding as part of their encoding because all schemes need to ensure there is space for noise growth during FHE operations.

Coefficient encoding for RLWE

One of the main benefits of RLWE-based FHE schemes is that you can pack lots of cleartexts into one plaintext. For this and all the other RLWE-based sections, the cleartext space is something like $(\mathbb{Z}/3\mathbb{Z})^{1024}$, vectors of small integers of some dimension. Many folks in the FHE world call $p$ the modulus of the cleartexts. And the plaintext space is something like $(\mathbb{Z}/2^{32}\mathbb{Z})[x] / (x^{1024} + 1)$, i.e., polynomials with large integer coefficients and a polynomial degree matching the cleartext space dimension. Many people call $q$ the coefficient modulus of the plaintext space.

In the coefficient encoding for RLWE, the bit-field encoding is applied to each input, and they are interpreted as coefficients of the polynomial.

This encoding scheme is also used in CGGI, in order to encrypt a lookup table as a polynomial for use in programmable bootstrapping. But it can also be used (though it is rarely used) in the BGV and BFV schemes, and rarely because both of those schemes use the polynomial multiplication to have semantic meaning. When you encode RLWE with the coefficient encoding, polynomial multiplication corresponds to a convolution of the underlying cleartexts, when most of the time those schemes prefer that multiplication corresponds to some kind of point-wise multiplication. The next encoding will handle that exactly.

Evaluation encoding for RLWE

The evaluation encoding borrows ideas from the Discrete Fourier Transform literature. See this post for a little bit more about why the DFT and polynomial multiplication are related.

The evaluation encoding encodes a vector $(v_1, \dots, v_N)$ by interpreting it as the output value of a polynomial $p(x)$ at some implicitly determined, but fixed points. These points are usually the roots of unity of $x^N + 1$ in the ring $\mathbb{Z}/q\mathbb{Z}$ (recall, the coefficients of the polynomial ring), and one computes this by picking $q$ in such a way that guarantees the multiplicative group $(\mathbb{Z}/q\mathbb{Z})^\times$ has a generator, which plays the analogous role of a $2N$-th root of unity that you would normally see in the complex numbers.

Once you have the root of unity, you can convert from the evaluation form to a coefficient form (which many schemes need for the encryption step) via an inverse number-theoretic transform (INTT). And then, of course, one must scale the coefficients using the bit field encoding to give room for noise. The coefficient form here is considered the “encoded” version of the cleartext vector.

Aside: one can perform the bit field encoding step before or after the INTT, since the bitfield encoding is equivalent to multiplying by a constant, and scaling a polynomial by a constant is equivalent to scaling its point evaluations by the same constant. Polynomial evaluation is a linear function of the coefficients.

The evaluation encoding is the most commonly used encoding used for both the BGV and BFV schemes. And then after encryption is done, one usually NTT’s back to the evaluation representation so that polynomial multiplication can be more quickly implemented as entry-wise multiplication.

Rounded canonical embeddings for RLWE

This embedding is for a family of FHE schemes related to the CKKS scheme, which focuses on approximate computation.

Here the cleartext space and plaintext spaces change slightly. The cleartext space is $\mathbb{C}^{N/2}$, and the plaintext space is again $(\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1)$ for some machine-word-sized power of two $q$. As you’ll note, the cleartext space is continuous but the plaintext space is discrete, so this necessitates some sort of approximation.

Aside: In the literature you will see the plaintext space described as just $(\mathbb{Z}[x] / (x^N + 1)$, and while this works in principle, in practice doing so requires multiprecision integer computations, and ends up being slower than the alternative, which is to use a residue number system before encoding, and treat the plaintext space as $(\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1)$. I’ll say more about RNS encoding in the next section.

The encoding is easier to understand by first describing the decoding step. Given a polynomial $f \in (\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1)$, there is a map called the canonical embedding $\varphi: (\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1) \to \mathbb{C}^N$ that evaluates $f$ at the odd powers of a primitive $2N$-th root of unity. I.e., letting $\omega = e^{2\pi i / 2N}$, we have

\[ \varphi(f) = (f(\omega), f(\omega^3), f(\omega^5), \dots, f(\omega^{2N-1})) \]

Aside: My algebraic number theory is limited (not much beyond a standard graduate course covering Galois theory), but this paper has some more background. My understanding is that we’re viewing the input polynomials as actually sitting inside the number field $\mathbb{Q}[x] / (x^N + 1)$ (or else $q$ is a prime and the original polynomial ring is a field), and the canonical embedding is a specialization of a more general theorem that says that for any subfield $K \subset \mathbb{C}$, the Galois group $K/\mathbb{Q}$ is exactly the set of injective homomorphisms $K \to \mathbb{C}$. I don’t recall exactly why these polynomial quotient rings count as subfields of $\mathbb{C}$, and I think it is not completely trivial (see, e.g., this stack exchange question).

As specialized to this setting, the canonical embedding is a scaled isometry for the 2-norm in both spaces. See this paper for a lot more detail on that. This is a critical aspect of the analysis for FHE, since operations in the ciphertext space add perturbations (noise) in the plaintext space, and it must be the case that those perturbations decode to similar perturbations so that one can use bounds on noise growth in the plaintext space to ensure the corresponding cleartexts stay within some desired precision.

Because polynomials commute with complex conjugation ($f(\overline{z}) = \overline{f(z)}$), and roots of unity satisfy $\overline{\omega^k} = \omega^{-k}$, this canonical embedding is duplicating information. We can throw out the second half of the roots of unity and retain the same structure (the scaling in the isometry changes as well). The result is that the canonical embedding is defined $\varphi: (\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1) \to \mathbb{C}^{N/2}$ via

\[ \varphi(f) = (f(\omega), f(\omega^3), \dots, f(\omega^{N-1})) \]

Since we’re again using the bit-field encoding to scale up the inputs for noise, the decoding is then defined by applying the canonical embedding, and then applying bit-field decoding (scaling down).

This decoding process embeds the discrete polynomial space inside $\mathbb{C}^{N/2}$ as a lattice, but input cleartexts need not lie on that lattice. And so we get to the encoding step, which involves rounding to a point on the lattice, then inverting the canonical embedding, then applying the bit-field encoding to scale up for noise.

Using commutativity, one can more specifically implement this by first inverting the canonical embedding (which again uses an FFT-like operation), the result of which is in $\mathbb{C}[x] / (x^N + 1)$, then apply the bit-field encoding to scale up, then round the coefficients to be in $\mathbb{Z}[x] / (x^N + 1)$. As mentioned above, if you want the coefficients to be machine-word-sized integers, you’ll have to design this all to ensure the outputs are sufficiently small, and then treat the output as $\mathbb{Z}/q\mathbb{Z}[x] / (x^N + 1)$. Or else use a RNS mechanism.

Residue Number System Pre-processing

In all of the above schemes, the cleartext spaces can be too small for practical use. In the CGGI scheme, for example, a typical cleartext space is only 3 or 4 bits. Some FHE schemes manage this by representing everything in terms of boolean circuits, and pack inputs to various boolean gates in those bits. That is what I’ve mainly focused on, but it has the downside of increasing the number of FHE operations, requiring deeper circuits and more noise management operations, which are slow. Other approaches try to use the numerical structure of the ciphertexts more deliberately, and Sunzi’s Theorem (colloquially known as the Chinese Remainder Theorem) comes to the rescue here.

There will be two “cleartext” spaces floating around here, one for the “original” message, which I’ll call the “original” cleartext space, and one for the Sunzi’s-theorem-decomposed message, which I’ll call the “RNS” cleartext space (RNS for residue number system).

The original cleartext space size $M$ must be a product of primes or co-prime integers $M = m_1 \cdot \dots \cdot m_r$, with each $m_i$ being small enough to be compatible with the desired FHE’s encoding. E.g., for a bit-field encoding, $M$ might be large, but each $m_i$ would have to be at most a 4-bit prime (which severely limits how much we can decompose).

Then, we represent a single original cleartext message $x \in \mathbb{Z}/M\mathbb{Z}$ via its residues mod each $m_i$. I.e., $x$ becomes $r$ different cleartexts $(x \mod m_1, x \mod m_2, \dots, x \mod m_r)$ in the RNS cleartext space. From there we can either encode all the cleartexts in a single plaintext—the various RLWE encodings support this so long as $r < N$ (or $N/2$ for the canonical embedding))—or else encode them as difference plaintexts. In the latter case, the executing program needs to ensure the plaintexts are jointly processed. E.g., any operation that happens to one must happen to all, to ensure that the residues stay in sync and can be reconstructed at the end.

And finally, after decoding we use the standard reconstruction algorithm from Sunzi’s theorem to rebuild the original cleartext from the decoded RNS cleartexts.

I’d like to write a bit more about RNS decompositions and Sunzi’s theorem in a future article, because it is critical to how many FHE schemes operate, and influences a lot of their designs. For example, I glazed over how inverting the canonical embedding works in detail, and it is related to Sunzi’s theorem in a deep way. So more on that in the future.

Negacyclic Polynomial Multiplication

In this article I’ll cover three techniques to compute special types of polynomial products that show up in lattice cryptography and fully homomorphic encryption. Namely, the negacyclic polynomial product, which is the product of two polynomials in the quotient ring $\mathbb{Z}[x] / (x^N + 1)$. As a precursor to the negacyclic product, we’ll cover the simpler cyclic product.

All of the Python code written for this article is on GitHub.

The DFT and Cyclic Polynomial Multiplication

A recent program gallery piece showed how single-variable polynomial multiplication could be implemented using the Discrete Fourier Transform (DFT). This boils down to two observations:

  1. The product of two polynomials $f, g$ can be computed via the convolution of the coefficients of $f$ and $g$.
  2. The Convolution Theorem, which says that the Fourier transform of a convolution of two signals $f, g$ is the point-wise product of the Fourier transforms of the two signals. (The same holds for the DFT)

This provides a much faster polynomial product operation than one could implement using the naïve polynomial multiplication algorithm (though see the last section for an implementation anyway). The DFT can be used to speed up large integer multiplication as well.

A caveat with normal polynomial multiplication is that one needs to pad the input coefficient lists with enough zeros so that the convolution doesn’t “wrap around.” That padding results in the output having length at least as large as the sum of the degrees of $f$ and $g$ (see the program gallery piece for more details).

If you don’t pad the polynomials, instead you get what’s called a cyclic polynomial product. More concretely, if the two input polynomials $f, g$ are represented by coefficient lists $(f_0, f_1, \dots, f_{N-1}), (g_0, g_1, \dots, g_{N-1})$ of length $N$ (implying the inputs are degree at most $N-1$, i.e., the lists may end in a tail of zeros), then the Fourier Transform technique computes

\[ f(x) \cdot g(x) \mod (x^N – 1) \]

This modulus is in the sense of a quotient ring $\mathbb{Z}[x] / (x^N – 1)$, where $(x^N – 1)$ denotes the ring ideal generated by $x^N-1$, i.e., all polynomials that are evenly divisible by $x^N – 1$. A particularly important interpretation of this quotient ring is achieved by interpreting the ideal generator $x^N – 1$ as an equation $x^N – 1 = 0$, also known as $x^N = 1$. To get the canonical ring element corresponding to any polynomial $h(x) \in \mathbb{Z}[x]$, you “set” $x^N = 1$ and reduce the polynomial until there are no more terms with degree bigger than $N-1$. For example, if $N=5$ then $x^{10} + x^6 – x^4 + x + 2 = -x^4 + 2x + 3$ (the $x^{10}$ becomes 1, and $x^6 = x$).

To prove the DFT product computes a product in this particular ring, note how the convolution theorem produces the following formula, where $\textup{fprod}(f, g)$ denotes the process of taking the Fourier transform of the two coefficient lists, multiplying them entrywise, and taking a (properly normalized) inverse FFT, and $\textup{fprod}(f, g)(j)$ is the $j$-th coefficient of the output polynomial:

\[ \textup{fprod}(f, g)(j) = \sum_{k=0}^{N-1} f_k g_{j-k \textup{ mod } N} \]

In words, the output polynomial coefficient $j$ equals the sum of all products of pairs of coefficients whose indices sum to $j$ when considered “wrapping around” $N$. Fixing $j=1$ as an example, $\textup{fprod}(f, g)(1) = f_0 g_1 + f_1g_0 + f_2 g_{N-1} + f_3 g_{N-2} + \dots$. This demonstrates the “set $x^N = 1$” interpretation above: the term $f_2 g_{N-1}$ corresponds to the product $f_2x^2 \cdot g_{N-1}x^{N-1}$, which contributes to the $x^1$ term of the polynomial product if and only if $x^{2 + N-1} = x$, if and only if $x^N = 1$.

To achieve this in code, we simply use the version of the code from the program gallery piece, but fix the size of the arrays given to numpy.fft.fft in advance. We will also, for simplicity, assume the $N$ one wishes to use is a power of 2. The resulting code is significantly simpler than the original program gallery code (we omit zero-padding to length $N$ for brevity).

import numpy
from numpy.fft import fft, ifft

def cyclic_polymul(p1, p2, N):
    """Multiply two integer polynomials modulo (x^N - 1).

    p1 and p2 are arrays of coefficients in degree-increasing order.
    """
    assert len(p1) == len(p2) == N
    product = fft(p1) * fft(p2)
    inverted = ifft(product)
    return numpy.round(numpy.real(inverted)).astype(numpy.int32)

As a side note, there’s nothing that stops this from working with polynomials that have real or complex coefficients, but so long as we use small magnitude integer coefficients and round at the end, I don’t have to worry about precision issues (hat tip to Brad Lucier for suggesting an excellent paper by Colin Percival, “Rapid multiplication modulo the sum and difference of highly composite numbers“, which covers these precision issues in detail).

Negacyclic polynomials, DFT with duplication

Now the kind of polynomial quotient ring that shows up in cryptography is critically not $\mathbb{Z}[x]/(x^N-1)$, because that ring has enough easy-to-reason-about structure that it can’t hide secrets. Instead, cryptographers use the ring $\mathbb{Z}[x]/(x^N+1)$ (the minus becomes a plus), which is believed to be more secure for cryptography—although I don’t have a great intuitive grasp on why.

The interpretation is similar here as before, except we “set” $x^N = -1$ instead of $x^N = 1$ in our reductions. Repeating the above example, if $N=5$ then $x^{10} + x^6 – x^4 + x + 2 = -x^4 + 3$ (the $x^{10}$ becomes $(-1)^2 = 1$, and $x^6 = -x$). It’s called negacyclic because as a term $x^k$ passes $k \geq N$, it cycles back to $x^0 = 1$, but with a sign flip.

The negacyclic polynomial multiplication can’t use the DFT without some special hacks. The first and simplest hack is to double the input lists with a negation. That is, starting from $f(x) \in \mathbb{Z}[x]/(x^N+1)$, we can define $f^*(x) = f(x) – x^Nf(x)$ in a different ring $\mathbb{Z}[x]/(x^{2N} – 1)$ (and similarly for $g^*$ and $g$).

Before seeing how this causes the DFT to (almost) compute a negacyclic polynomial product, some math wizardry. The ring $\mathbb{Z}[x]/(x^{2N} – 1)$ is special because it contains our negacyclic ring as a subring. Indeed, because the polynomial $x^{2N} – 1$ factors as $(x^N-1)(x^N+1)$, and because these two factors are coprime in $\mathbb{Z}[x]/(x^{2N} – 1)$, the Chinese remainder theorem (aka Sun-tzu’s theorem) generalizes to polynomial rings and says that any polynomial in $\mathbb{Z}[x]/(x^{2N} – 1)$ is uniquely determined by its remainders when divided by $(x^N-1)$ and $(x^N+1)$. Another way to say it is that the ring $\mathbb{Z}[x]/(x^{2N} – 1)$ factors as a direct product of the two rings $\mathbb{Z}[x]/(x^{N} – 1)$ and $\mathbb{Z}[x]/(x^{N} + 1)$.

Now mapping a polynomial $f(x)$ from the bigger ring $(x^{2N} – 1)$ to the smaller ring $(x^{N}+1)$ involves taking a remainder of $f(x)$ when dividing by $x^{N}+1$ (“setting” $x^N = -1$ and reducing). There are many possible preimage mappings, depending on what your goal is. In this case, we actually intentionally choose a non preimage mapping, because in general to compute a preimage requires solving a system of congruences in the larger polynomial ring. So instead we choose $f(x) \mapsto f^*(x) = f(x) – x^Nf(x) = -f(x)(x^N – 1)$, which maps back down to $2f(x)$ in $\mathbb{Z}[x]/(x^{N} + 1)$. This preimage mapping has a particularly nice structure, in that you build it by repeating the polynomial’s coefficients twice and flipping the sign of the second half. It’s easy to see that the product $f^*(x) g^*(x)$ maps down to $4f(x)g(x)$.

So if we properly account for these extra constant factors floating around, our strategy to perform negacyclic polynomial multiplication is to map $f$ and $g$ up to the larger ring as described, compute their cyclic product (modulo $x^{2N} – 1$) using the FFT, and then the result should be a degree $2N-1$ polynomial which can be reduced with one more modular reduction step to the right degree $N-1$ negacyclic product, i.e., setting $x^N = -1$, which materializes as taking the second half of the coefficients, flipping their signs, and adding them to the corresponding coefficients in the first half.

The code for this is:

def negacyclic_polymul_preimage_and_map_back(p1, p2):
    p1_preprocessed = numpy.concatenate([p1, -p1])
    p2_preprocessed = numpy.concatenate([p2, -p2])
    product = fft(p1_preprocessed) * fft(p2_preprocessed)
    inverted = ifft(product)
    rounded = numpy.round(numpy.real(inverted)).astype(p1.dtype)
    return (rounded[: p1.shape[0]] - rounded[p1.shape[0] :]) // 4

However, this chosen mapping hides another clever trick. The product of the two preimages has enough structure that we can “read” the result off without doing the full “set $x^N = -1$” reduction step. Mapping $f$ and $g$ up to $f^*, g^*$ and taking their product modulo $(x^{2N} – 1)$ gives

\[ \begin{aligned} f^*g^* &= -f(x^N-1) \cdot -g(x^N – 1) \\ &= fg (x^N-1)^2 \\ &= fg(x^{2N} – 2x^N + 1) \\ &= fg(2 – 2x^N) \\ &= 2(fg – x^Nfg) \end{aligned} \]

This has the same syntactical format as the original mapping $f \mapsto f – x^Nf$, with an extra factor of 2, and so its coefficients also have the form “repeat the coefficients and flip the sign of the second half” (times two). We can then do the “inverse mapping” by reading only the first half of the coefficients and dividing by 2.

def negacyclic_polymul_use_special_preimage(p1, p2):
    p1_preprocessed = numpy.concatenate([p1, -p1])
    p2_preprocessed = numpy.concatenate([p2, -p2])
    product = fft(p1_preprocessed) * fft(p2_preprocessed)
    inverted = ifft(product)
    rounded = numpy.round(0.5 * numpy.real(inverted)).astype(p1.dtype)
    return rounded[: p1.shape[0]]

Our chosen mapping $f \mapsto f-x^Nf$ is not particularly special, except that it uses a small number of pre and post-processing operations. For example, if you instead used the mapping $f \mapsto 2f + x^Nf$ (which would map back to $f$ exactly), then the FFT product would result in $5fg + 4x^Nfg$ in the larger ring. You can still read off the coefficients as before, but you’d have to divide by 5 instead of 2 (which, the superstitious would say, is harder). It seems that “double and negate” followed by “halve and take first half” is the least amount of pre/post processing possible.

Negacyclic polynomials with a “twist”

The previous section identified a nice mapping (or embedding) of the input polynomials into a larger ring. But studying that shows some symmetric structure in the FFT output. I.e., the coefficients of $f$ and $g$ are repeated twice, with some scaling factors. It also involves taking an FFT of two $2N$-dimensional vectors when we start from two $N$-dimensional vectors.

This sort of situation should make you think that we can do this more efficiently, either by using a smaller size FFT or by packing some data into the complex part of the input, and indeed we can do both.

[Aside: it’s well known that if all the entries of an FFT input are real, then the result also has symmetry that can be exploted for efficiency by reframing the problem as a size-N/2 FFT in some cases, and just removing half the FFT algorithm’s steps in other cases, see Wikipedia for more]

This technique was explained in Fast multiplication and its applications (pdf link) by Daniel Bernstein, a prominent cryptographer who specializes in cryptography performance, and whose work appears in widely-used standards like TLS, OpenSSH, and he designed a commonly used elliptic curve for cryptography.

[Aside: Bernstein cites this technique as using something called the “Tangent FFT (pdf link).” This is a drop-in FFT replacement he invented that is faster than previous best (split-radix FFT), and Bernstein uses it mainly to give a precise expression for the number of operations required to do the multiplication end to end. We will continue to use the numpy FFT implementation, since in this article I’m just focusing on how to express negacyclic multiplication in terms of the FFT. Also worth noting both the Tangent FFT and “Fast multiplication” papers frame their techniques—including FFT algorithm implementations!—in terms of polynomial ring factorizations and mappings. Be still, my beating cardioid.]

In terms of polynomial mappings, we start from the ring $\mathbb{R}[x] / (x^N + 1)$, where $N$ is a power of 2. We then pick a reversible mapping from $\mathbb{R}[x]/(x^N + 1) \to \mathbb{C}[x]/(x^{N/2} – 1)$ (note the field change from real to complex), apply the FFT to the image of the mapping, and reverse appropriately it at the end.

One such mapping takes two steps, first mapping $\mathbb{R}[x]/(x^N + 1) \to \mathbb{C}[x]/(x^{N/2} – i)$ and then from $\mathbb{C}[x]/(x^{N/2} – i) \to \mathbb{C}[x]/(x^{N/2} – 1)$. The first mapping is as easy as the last section, because $(x^N + 1) = (x^{N/2} + i) (x^{N/2} – i)$, and so we can just set $x^{N/2} = i$ and reduce the polynomial. This as the effect of making the second half of the polynomial’s coefficients become the complex part of the first half of the coefficients.

The second mapping is more nuanced, because we’re not just reducing via factorization. And we can’t just map $i \mapsto 1$ generically, because that would reduce complex numbers down to real values. Instead, we observe that (momentarily using an arbitrary degree $k$ instead of $N/2$), for any polynomial $f \in \mathbb{C}[x]$, the remainder of $f \mod x^k-i$ uniquely determines the remainder of $f \mod x^k – 1$ via the change of variables $x \mapsto \omega_{4k} x$, where $\omega_{4k}$ is a $4k$-th primitive root of unity $\omega_{4k} = e^{\frac{2 \pi i}{4k}}$. Spelling this out in more detail: if $f(x) \in \mathbb{C}[x]$ has remainder $f(x) = g(x) + h(x)(x^k – i)$ for some polynomial $h(x)$, then

\[ \begin{aligned} f(\omega_{4k}x) &= g(\omega_{4k}x) + h(\omega_{4k}x)((\omega_{4k}x)^{k} – i) \\ &= g(\omega_{4k}x) + h(\omega_{4k}x)(e^{\frac{\pi i}{2}} x^k – i) \\ &= g(\omega_{4k}x) + i h(\omega_{4k}x)(x^k – 1) \\ &= g(\omega_{4k}x) \mod (x^k – 1) \end{aligned} \]

Translating this back to $k=N/2$, the mapping from $\mathbb{C}[x]/(x^{N/2} – i) \to \mathbb{C}[x]/(x^{N/2} – 1)$ is $f(x) \mapsto f(\omega_{2N}x)$. And if $f = f_0 + f_1x + \dots + f_{N/2 – 1}x^{N/2 – 1}$, then the mapping involves multiplying each coefficient $f_k$ by $\omega_{2N}^k$.

When you view polynomials as if they were a simple vector of their coefficients, then this operation $f(x) \mapsto f(\omega_{k}x)$ looks like $(a_0, a_1, \dots, a_n) \mapsto (a_0, \omega_{k} a_1, \dots, \omega_k^n a_n)$. Bernstein calls the operation a twist of $\mathbb{C}^n$, which I mused about in this Mathstodon thread.

What’s most important here is that each of these transformations are invertible. The first because the top half coefficients end up in the complex parts of the polynomial, and the second because the mapping $f(x) \mapsto f(\omega_{2N}^{-1}x)$ is an inverse. Together, this makes the preprocessing and postprocessing exact inverses of each other. The code is then

def negacyclic_polymul_complex_twist(p1, p2):
    n = p2.shape[0]
    primitive_root = primitive_nth_root(2 * n)
    root_powers = primitive_root ** numpy.arange(n // 2)

    p1_preprocessed = (p1[: n // 2] + 1j * p1[n // 2 :]) * root_powers
    p2_preprocessed = (p2[: n // 2] + 1j * p2[n // 2 :]) * root_powers

    p1_ft = fft(p1_preprocessed)
    p2_ft = fft(p2_preprocessed)
    prod = p1_ft * p2_ft
    ifft_prod = ifft(prod)
    ifft_rotated = ifft_prod * primitive_root ** numpy.arange(0, -n // 2, -1)

    return numpy.round(
        numpy.concatenate([numpy.real(ifft_rotated), numpy.imag(ifft_rotated)])
    ).astype(p1.dtype)

And so, at the cost of a bit more pre- and postprocessing, we can negacyclically multiply two degree $N-1$ polynomials using an FFT of length $N/2$. In theory, no information is wasted and this is optimal.

And finally, a simple matrix multiplication

The last technique I wanted to share is not based on the FFT, but it’s another method for doing negacyclic polynomial multiplication that has come in handy in situations where I am unable to use FFTs. I call it the Toeplitz method, because one of the polynomials is converted to a Toeplitz matrix. Sometimes I hear it referred to as a circulant matrix technique, but due to the negacyclic sign flip, I don’t think it’s a fully accurate term.

The idea is to put the coefficients of one polynomial $f(x) = f_0 + f_1x + \dots + f_{N-1}x^{N-1}$ into a matrix as follows:

\[ \begin{pmatrix} f_0 & -f_{N-1} & \dots & -f_1 \\ f_1 & f_0 & \dots & -f_2 \\ \vdots & \vdots & \ddots & \vdots \\ f_{N-1} & f_{N-2} & \dots & f_0 \end{pmatrix} \]

The polynomial coefficients are written down in the first column unchanged, then in each subsequent column, the coefficients are cyclically shifted down one, and the term that wraps around the top has its sign flipped. When the second polynomial is treated as a vector of its coefficients, say, $g(x) = g_0 + g_1x + \dots + g_{N-1}x^{N-1}$, then the matrix-vector product computes their negacyclic product (as a vector of coefficients):

\[ \begin{pmatrix} f_0 & -f_{N-1} & \dots & -f_1 \\ f_1 & f_0 & \dots & -f_2 \\ \vdots & \vdots & \ddots & \vdots \\ f_{N-1} & f_{N-2} & \dots & f_0 \end{pmatrix} \begin{pmatrix} g_0 \\ g_1 \\ \vdots \\ g_{N-1} \end{pmatrix} \]

This works because each row $j$ corresponds to one output term $x^j$, and the cyclic shift for that row accounts for the degree-wrapping, with the sign flip accounting for the negacyclic part. (If there were no sign attached, this method could be used to compute a cyclic polynomial product).

The Python code for this is

def cylic_matrix(c: numpy.array) -> numpy.ndarray:
    """Generates a cyclic matrix with each row of the input shifted.

    For input: [1, 2, 3], generates the following matrix:

        [[1 2 3]
         [2 3 1]
         [3 1 2]]
    """
    c = numpy.asarray(c).ravel()
    a, b = numpy.ogrid[0 : len(c), 0 : -len(c) : -1]
    indx = a + b
    return c[indx]


def negacyclic_polymul_toeplitz(p1, p2):
    n = len(p1)

    # Generates a sign matrix with 1s below the diagonal and -1 above.
    up_tri = numpy.tril(numpy.ones((n, n), dtype=int), 0)
    low_tri = numpy.triu(numpy.ones((n, n), dtype=int), 1) * -1
    sign_matrix = up_tri + low_tri

    cyclic_matrix = cylic_matrix(p1)
    toeplitz_p1 = sign_matrix * cyclic_matrix
    return numpy.matmul(toeplitz_p1, p2)

Obviously on most hardware this would be less efficient than an FFT-based method (and there is some relationship between circulant matrices and Fourier Transforms, see Wikipedia). But in some cases—when the polynomials are small, or one of the two polynomials is static, or a particular hardware choice doesn’t handle FFTs with high-precision floats very well, or you want to take advantage of natural parallelism in the matrix-vector product—this method can be useful. It’s also simpler to reason about.

Until next time!

Polynomial Multiplication Using the FFT

Problem: Compute the product of two polynomials efficiently.

Solution:

import numpy
from numpy.fft import fft, ifft


def poly_mul(p1, p2):
    """Multiply two polynomials.

    p1 and p2 are arrays of coefficients in degree-increasing order.
    """
    deg1 = p1.shape[0] - 1
    deg2 = p1.shape[0] - 1
    # Would be 2*(deg1 + deg2) + 1, but the next-power-of-2 handles the +1
    total_num_pts = 2 * (deg1 + deg2)
    next_power_of_2 = 1 << (total_num_pts - 1).bit_length()

    ff_p1 = fft(numpy.pad(p1, (0, next_power_of_2 - p1.shape[0])))
    ff_p2 = fft(numpy.pad(p2, (0, next_power_of_2 - p2.shape[0])))
    product = ff_p1 * ff_p2
    inverted = ifft(product)
    rounded = numpy.round(numpy.real(inverted)).astype(numpy.int32)
    return numpy.trim_zeros(rounded, trim='b')

Discussion: The Fourier Transform has a lot of applications to science, and I’ve covered it on this blog before, see the Signal Processing section of Main Content. But it also has applications to fast computational mathematics.

The naive algorithm for multiplying two polynomials is the “grade-school” algorithm most readers will already be familiar with (see e.g., this page), but for large polynomials that algorithm is slow. It requires $O(n^2)$ arithmetic operations to multiply two polynomials of degree $n$.

This short tip shows a different approach, which is based on the idea of polynomial interpolation. As a side note, I show the basic theory of polynomial interpolation in chapter 2 of my book, A Programmer’s Introduction to Mathematics, along with an application to cryptography called “Secret Sharing.”

The core idea is that given $n+1$ distinct evaluations of a polynomial $p(x)$ (i.e., points $(x, p(x))$ with different $x$ inputs), you can reconstruct the coefficients of $p(x)$ exactly. And if you have two such point sets for two different polynomials $p(x), q(x)$, a valid point set of the product $(pq)(x)$ is the product of the points that have the same $x$ inputs.

\[ \begin{aligned} p(x) &= \{ (x_0, p(x_0)), (x_1, p(x_1)), \dots, (x_n, p(x_n)) \} \\ q(x) &= \{ (x_0, q(x_0)), (x_1, q(x_1)), \dots, (x_n, q(x_n)) \} \\ (pq)(x) &= \{ (x_0, p(x_0)q(x_0)), (x_1, p(x_1)q(x_1)), \dots, (x_n, p(x_n)q(x_n)) \} \end{aligned} \]

The above uses $=$ loosely to represent that the polynomial $p$ can be represented by the point set on the right hand side.

So given two polynomials $p(x), q(x)$ in their coefficient forms, one can first convert them to their point forms, multiply the points, and then reconstruct the resulting product.

The problem is that the two conversions, both to and from the coefficient form, are inefficient for arbitrary choices of points $x_0, \dots, x_n$. The trick comes from choosing special points, in such a way that the intermediate values computed in the conversion steps can be reused. This is where the Fourier Transform comes in: choose $x_0 = \omega_{N}$, the complex-N-th root of unity, and $x_k = \omega_N^k$ as its exponents. $N$ is required to be large enough so that $\omega_N$’s exponents have at least $2n+1$ distinct values required for interpolating a degree-at-most-$2n$ polynomial, and because we’re doing the Fourier Transform, it will naturally be “the next largest power of 2” bigger than the degree of the product polynomial.

Then one has to observe that, by its very formula, the Fourier Transform is exactly the evaluation of a polynomial at the powers of the $N$-th root of unity! In formulas: if $a = (a_0, \dots, a_{n-1})$ is a list of real numbers define $p_a(x) = a_0 + a_1x + \dots + a_{n-1}x^{n-1}$. Then $\mathscr{F}(a)(k)$, the Fourier Transform of $a$ at index $k$, is equal to $p_a(\omega_n^k)$. These notes by Denis Pankratov have more details showing that the Fourier Transform formula is a polynomial evaluation (see Section 3), and this YouTube video by Reducible also has a nice exposition. This interpretation of the FT as polynomial evaluation seems to inspire quite a few additional techniques for computing the Fourier Transform that I plan to write about in the future.

The last step is to reconstruct the product polynomial from the product of the two point sets, but because the Fourier Transform is an invertible function (and linear, too), the inverse Fourier Transform does exactly that: given a list of the $n$ evaluations of a polynomial at $\omega_n^k, k=0, \dots, n-1$, return the coefficients of the polynomial.

This all fits together into the code above:

  1. Pad the input coefficient lists with zeros, so that the lists are a power of 2 and at least 1 more than the degree of the output product polynomial.
  2. Compute the FFT of the two padded polynomials.
  3. Multiply the result pointwise.
  4. Compute the IFFT of the product.
  5. Round the resulting (complex) values back to integers.

Hey, wait a minute! What about precision issues?

They are a problem when you have large numbers or large polynomials, because the intermediate values in steps 2-4 can lose precision due to the floating point math involved. This short note of Richard Fateman discusses some of those issues, and two paths forward include: deal with it somehow, or use an integer-exact analogue called the Number Theoretic Transform (which itself has issues I’ll discuss in a future, longer article).

Postscript: I’m not sure how widely this technique is used. It appears the NTL library uses the polynomial version of Karatsuba multiplication instead (though it implements FFT elsewhere). However, I know for sure that much software involved in doing fully homomorphic encryption rely on the FFT for performance reasons, and the ones that don’t instead use the NTT.

Searching for RH Counterexamples — Search Strategies

We’re glibly searching for counterexamples to the Riemann Hypothesis, to trick you into learning about software engineering principles. In the first two articles we configured a testing framework and showed how to hide implementation choices behind an interface. Next, we’ll improve the algorithm’s core routine. As before, I’ll link to specific git commits in the final code repository to show how the project evolves.

Superabundant numbers

A superabundant number $ n$ is one which has “maximal relative divisor sums” in the following sense: for all $ m < n$,

$ \displaystyle \frac{\sigma(m)}{m} < \frac{\sigma(n)}{n}$

where $ \sigma(n)$ is the sum of the divisors of $ n$.

Erdős and Alaoglu proved in 1944 (“On highly composite and similar numbers“) that superabundant numbers have a specific prime decomposition, in which all initial primes occur with non-increasing exponents

$ \displaystyle n = \prod_{i=1}^k (p_i)^{a_i},$

where $ p_i$ is the i-th prime, and $ a_1 \geq a_2 \geq \dots \geq a_k \geq 1$. With two exceptions ($ n=4, 36$), $ a_k = 1$.

Here’s a rough justification for why superabundant numbers should have a decomposition like this. If you want a number with many divisors (compared to the size of the number), you want to pack as many combinations of small primes into the decomposition of your number as possible. Using all 2’s leads to not enough combinations—only $ m+1$ divisors for $ 2^m$—but using 2′ and 3’s you get $ (r+1)(s+1)$ for $ 2^r3^s$. Using more 3’s trades off a larger number $ n$ for the benefit of a larger $ \sigma(n)$ (up to $ r=s$). The balance between getting more distinct factor combinations and a larger $ n$ favors packing the primes in there.

Though numbers of this form are not necessarily superabundant, this gives us an enumeration strategy better than trying all numbers. Enumerate over tuples corresponding to the exponents of the prime decomposition (non-increasing lists of integers), and save those primes to make it easier to compute the divisor sum.

Non-increasing lists of integers can be enumerated in the order of their sum, and for each sum $ N$, the set of non-increasing lists of integers summing to $ N$ is called the partitions of $ N$. There is a simple algorithm to compute them, implemented in this commit. Note this does not enumerate them in order of the magnitude of the number $ \prod_{i=1}^k (p_i)^{a_i}$.

The implementation for the prime-factorization-based divisor sum computation is in this commit. In addition, to show some alternative methods of testing, we used the hypothesis library to autogenerate tests. It chooses a random (limited size) prime factorization, and compares the prime-factorization-based algorithm to the naive algorithm. There’s a bit of setup code involved, but as a result we get dozens of tests and more confidence it’s right.

Search Strategies

We now have two search strategies over the space of natural numbers, though one is obviously better. We may come up with a third, so it makes sense to separate the search strategy from the main application by an interface. Generally, if you have a hard-coded implementation, and you realize that you need to change it in a significant way, that’s a good opportunity to extract it and hide it behind an interface.

A good interface choice is a bit tricky here, however. In the original implementation, we could say, “process the batch of numbers (search for counterexamples) between 1 and 2 million.” When that batch is saved to the database, we would start on the next batch, and all the batches would be the same size, so (ignoring that computing $ \sigma(n)$ the old way takes longer as $ n$ grows) each batch required roughly the same time to run.

The new search strategy doesn’t have a sensible way to do this. You can’t say “start processing from K” because we don’t know how to easily get from K to the parameter of the enumeration corresponding to K (if one exists). This is partly because our enumeration isn’t monotonic increasing ($ 2^1 3^1 5^1 = 30$ comes before $ 2^4 = 16$). And partly because even if we did have a scheme, it would almost certainly require us to compute a prime factorization, which is slow. It would be better if we could save the data from the latest step of the enumeration, and load it up when starting the next batch of the search.

This scheme suggests a nicely generic interface for stopping and restarting a search from a particular spot. The definition of a “spot,” and how to start searching from that spot, are what’s hidden by the interface. Here’s a first pass.

SearchState = TypeVar('SearchState')

class SearchStrategy(ABC):
    @abstractmethod
    def starting_from(self, search_state: SearchState) -> SearchStrategy:
        '''Reset the search strategy to search from a given state.'''
        pass

    @abstractmethod
    def search_state(self) -> SearchState:
        '''Get an object describing the current state of the enumeration.'''
        pass

    @abstractmethod
    def next_batch(self, batch_size: int) -> List[RiemannDivisorSum]:
        '''Process the next batch of Riemann Divisor Sums'''
        pass

Note that SearchState is defined as a generic type variable because we cannot say anything about its structure yet. The implementation class is responsible for defining what constitutes a search state, and getting the search strategy back to the correct step of the enumeration given the search state as input. Later I realized we do need some structure on the SearchState—the ability to serialize it for storage in the database—so we elevated it to an interface later.

Also note that we are making SearchStrategy own the job of computing the Riemann divisor sums. This is because the enumeration details and the algorithm to compute the divisor sums are now coupled. For the exhaustive search strategy it was “integers n, naively loop over smaller divisors.” In the new strategy it’s “prime factorizations, prime-factorization-based divisor sum.” We could decouple this, but there is little reason to now because the implementations are still in 1-1 correspondence.

This commit implements the old search strategy in terms of this interface, and this commit implements the new search strategy. In the latter, I use pytest.parameterize to test against the interface and parameterize over the implementations.

The last needed bit is the ability to store and recover the search state in between executions of the main program. This requires a second database table. The minimal thing we could do is just store and update a single row for each search strategy, providing the search state as of the last time the program was run and stopped. This would do, but in my opinion an append-only log is a better design for such a table. That is, each batch computed will have a record containing the timestamp the batch started and finished, along with the starting and ending search state. We can use the largest timestamp for a given search strategy to pick up where we left off across program runs.

One can imagine this being the basis for an application like folding@home or the BOINC family of projects, where a database stores chunks of a larger computation (ranges of a search space), clients can request chunks to complete, and they are assembled into a complete database. In this case we might want to associate the chunk metadata with the computed results (say, via a foreign key). That would require a bit of work from what we have now, but note that the interfaces would remain reusable for this. For now, we will just incorporate the basic table approach. It is completed in this pull request, and tying it into the main search routine is done in this commit.

However, when running it with the superabundant search strategy, we immediately run into a problem. Superabundant numbers grow too fast, and within a few small batches of size 100 we quickly exceed the 64 bits available to numba and sqlite to store the relevant data.

>>> fac = partition_to_prime_factorization(partitions_of_n(16)[167])
>>> fac2 = [p**d for (p, d) in fac]
>>> fac2
[16, 81, 625, 2401, 11, 13, 17, 19, 23, 29, 31, 37]
>>> math.log2(reduce(lambda x,y: x*y, fac2))
65.89743638933722

Running populate_database.py results in the error

$ python -m riemann.populate_database db.sqlite3 SuperabundantSearchStrategy 100
Searching with strategy SuperabundantSearchStrategy
Starting from search state SuperabundantEnumerationIndex(level=1, index_in_level=0)
Computed [1,0, 10,4] in 0:00:03.618798
Computed [10,4, 12,6] in 0:00:00.031451
Computed [12,6, 13,29] in 0:00:00.031518
Computed [13,29, 14,28] in 0:00:00.041464
Computed [14,28, 14,128] in 0:00:00.041674
Computed [14,128, 15,93] in 0:00:00.034419
...
OverflowError: Python int too large to convert to SQLite INTEGER

We’ll see what we can do about this in a future article, but meanwhile we do get some additional divisor sums for these large numbers, and 10080 is still the best.

sqlite> select n, witness_value 
from RiemannDivisorSums 
where witness_value > 1.7 and n > 5040 
order by witness_value desc 
limit 10;

10080|1.7558143389253
55440|1.75124651488749
27720|1.74253672381383
7560|1.73991651920276
15120|1.73855867428903
160626866400|1.73744669257158
321253732800|1.73706925385011
110880|1.73484901030336
6983776800|1.73417642212953
720720|1.73306535623807