Negacyclic Polynomial Multiplication || Math ∩ Programming

In this article I’ll cover three techniques to compute special types of polynomial products that show up in lattice cryptography and fully homomorphic encryption. Namely, the negacyclic polynomial product, which is the product of two polynomials in the quotient ring $\mathbb{Z}[x] / (x^N + 1)$. As a precursor to the negacyclic product, we’ll cover the simpler cyclic product.

All of the Python code written for this article is on GitHub.

The DFT and Cyclic Polynomial Multiplication

A recent program gallery piece showed how single-variable polynomial multiplication could be implemented using the Discrete Fourier Transform (DFT). This boils down to two observations:

The product of two polynomials $f, g$ can be computed via the convolution of the coefficients of $f$ and $g$.
The Convolution Theorem, which says that the Fourier transform of a convolution of two signals $f, g$ is the point-wise product of the Fourier transforms of the two signals. (The same holds for the DFT)

This provides a much faster polynomial product operation than one could implement using the naïve polynomial multiplication algorithm (though see the last section for an implementation anyway). The DFT can be used to speed up large integer multiplication as well.

A caveat with normal polynomial multiplication is that one needs to pad the input coefficient lists with enough zeros so that the convolution doesn’t “wrap around.” That padding results in the output having length at least as large as the sum of the degrees of $f$ and $g$ (see the program gallery piece for more details).

If you don’t pad the polynomials, instead you get what’s called a cyclic polynomial product. More concretely, if the two input polynomials $f, g$ are represented by coefficient lists $(f_0, f_1, \dots, f_{N-1}), (g_0, g_1, \dots, g_{N-1})$ of length $N$ (implying the inputs are degree at most $N-1$, i.e., the lists may end in a tail of zeros), then the Fourier Transform technique computes

\[ f(x) \cdot g(x) \mod (x^N – 1) \]

This modulus is in the sense of a quotient ring $\mathbb{Z}[x] / (x^N – 1)$, where $(x^N – 1)$ denotes the ring ideal generated by $x^N-1$, i.e., all polynomials that are evenly divisible by $x^N – 1$. A particularly important interpretation of this quotient ring is achieved by interpreting the ideal generator $x^N – 1$ as an equation $x^N – 1 = 0$, also known as $x^N = 1$. To get the canonical ring element corresponding to any polynomial $h(x) \in \mathbb{Z}[x]$, you “set” $x^N = 1$ and reduce the polynomial until there are no more terms with degree bigger than $N-1$. For example, if $N=5$ then $x^{10} + x^6 – x^4 + x + 2 = -x^4 + 2x + 3$ (the $x^{10}$ becomes 1, and $x^6 = x$).

To prove the DFT product computes a product in this particular ring, note how the convolution theorem produces the following formula, where $\textup{fprod}(f, g)$ denotes the process of taking the Fourier transform of the two coefficient lists, multiplying them entrywise, and taking a (properly normalized) inverse FFT, and $\textup{fprod}(f, g)(j)$ is the $j$-th coefficient of the output polynomial:

\[ \textup{fprod}(f, g)(j) = \sum_{k=0}^{N-1} f_k g_{j-k \textup{ mod } N} \]

In words, the output polynomial coefficient $j$ equals the sum of all products of pairs of coefficients whose indices sum to $j$ when considered “wrapping around” $N$. Fixing $j=1$ as an example, $\textup{fprod}(f, g)(1) = f_0 g_1 + f_1g_0 + f_2 g_{N-1} + f_3 g_{N-2} + \dots$. This demonstrates the “set $x^N = 1$” interpretation above: the term $f_2 g_{N-1}$ corresponds to the product $f_2x^2 \cdot g_{N-1}x^{N-1}$, which contributes to the $x^1$ term of the polynomial product if and only if $x^{2 + N-1} = x$, if and only if $x^N = 1$.

To achieve this in code, we simply use the version of the code from the program gallery piece, but fix the size of the arrays given to numpy.fft.fft in advance. We will also, for simplicity, assume the $N$ one wishes to use is a power of 2. The resulting code is significantly simpler than the original program gallery code (we omit zero-padding to length $N$ for brevity).

import numpy
from numpy.fft import fft, ifft

def cyclic_polymul(p1, p2, N):
    """Multiply two integer polynomials modulo (x^N - 1).

    p1 and p2 are arrays of coefficients in degree-increasing order.
    """
    assert len(p1) == len(p2) == N
    product = fft(p1) * fft(p2)
    inverted = ifft(product)
    return numpy.round(numpy.real(inverted)).astype(numpy.int32)

As a side note, there’s nothing that stops this from working with polynomials that have real or complex coefficients, but so long as we use small magnitude integer coefficients and round at the end, I don’t have to worry about precision issues (hat tip to Brad Lucier for suggesting an excellent paper by Colin Percival, “Rapid multiplication modulo the sum and difference of highly composite numbers“, which covers these precision issues in detail).

Negacyclic polynomials, DFT with duplication

Now the kind of polynomial quotient ring that shows up in cryptography is critically not $\mathbb{Z}[x]/(x^N-1)$, because that ring has enough easy-to-reason-about structure that it can’t hide secrets. Instead, cryptographers use the ring $\mathbb{Z}[x]/(x^N+1)$ (the minus becomes a plus), which is believed to be more secure for cryptography—although I don’t have a great intuitive grasp on why.

The interpretation is similar here as before, except we “set” $x^N = -1$ instead of $x^N = 1$ in our reductions. Repeating the above example, if $N=5$ then $x^{10} + x^6 – x^4 + x + 2 = -x^4 + 3$ (the $x^{10}$ becomes $(-1)^2 = 1$, and $x^6 = -x$). It’s called negacyclic because as a term $x^k$ passes $k \geq N$, it cycles back to $x^0 = 1$, but with a sign flip.

The negacyclic polynomial multiplication can’t use the DFT without some special hacks. The first and simplest hack is to double the input lists with a negation. That is, starting from $f(x) \in \mathbb{Z}[x]/(x^N+1)$, we can define $f^*(x) = f(x) – x^Nf(x)$ in a different ring $\mathbb{Z}[x]/(x^{2N} – 1)$ (and similarly for $g^*$ and $g$).

Before seeing how this causes the DFT to (almost) compute a negacyclic polynomial product, some math wizardry. The ring $\mathbb{Z}[x]/(x^{2N} – 1)$ is special because it contains our negacyclic ring as a subring. Indeed, because the polynomial $x^{2N} – 1$ factors as $(x^N-1)(x^N+1)$, and because these two factors are coprime in $\mathbb{Z}[x]/(x^{2N} – 1)$, the Chinese remainder theorem (aka Sun-tzu’s theorem) generalizes to polynomial rings and says that any polynomial in $\mathbb{Z}[x]/(x^{2N} – 1)$ is uniquely determined by its remainders when divided by $(x^N-1)$ and $(x^N+1)$. Another way to say it is that the ring $\mathbb{Z}[x]/(x^{2N} – 1)$ factors as a direct product of the two rings $\mathbb{Z}[x]/(x^{N} – 1)$ and $\mathbb{Z}[x]/(x^{N} + 1)$.

Now mapping a polynomial $f(x)$ from the bigger ring $(x^{2N} – 1)$ to the smaller ring $(x^{N}+1)$ involves taking a remainder of $f(x)$ when dividing by $x^{N}+1$ (“setting” $x^N = -1$ and reducing). There are many possible preimage mappings, depending on what your goal is. In this case, we actually intentionally choose a non preimage mapping, because in general to compute a preimage requires solving a system of congruences in the larger polynomial ring. So instead we choose $f(x) \mapsto f^*(x) = f(x) – x^Nf(x) = -f(x)(x^N – 1)$, which maps back down to $2f(x)$ in $\mathbb{Z}[x]/(x^{N} + 1)$. This preimage mapping has a particularly nice structure, in that you build it by repeating the polynomial’s coefficients twice and flipping the sign of the second half. It’s easy to see that the product $f^*(x) g^*(x)$ maps down to $4f(x)g(x)$.

So if we properly account for these extra constant factors floating around, our strategy to perform negacyclic polynomial multiplication is to map $f$ and $g$ up to the larger ring as described, compute their cyclic product (modulo $x^{2N} – 1$) using the FFT, and then the result should be a degree $2N-1$ polynomial which can be reduced with one more modular reduction step to the right degree $N-1$ negacyclic product, i.e., setting $x^N = -1$, which materializes as taking the second half of the coefficients, flipping their signs, and adding them to the corresponding coefficients in the first half.

The code for this is:

def negacyclic_polymul_preimage_and_map_back(p1, p2):
    p1_preprocessed = numpy.concatenate([p1, -p1])
    p2_preprocessed = numpy.concatenate([p2, -p2])
    product = fft(p1_preprocessed) * fft(p2_preprocessed)
    inverted = ifft(product)
    rounded = numpy.round(numpy.real(inverted)).astype(p1.dtype)
    return (rounded[: p1.shape[0]] - rounded[p1.shape[0] :]) // 4

However, this chosen mapping hides another clever trick. The product of the two preimages has enough structure that we can “read” the result off without doing the full “set $x^N = -1$” reduction step. Mapping $f$ and $g$ up to $f^*, g^*$ and taking their product modulo $(x^{2N} – 1)$ gives

\[ \begin{aligned} f^\*g^\* &= -f(x^N-1) \cdot -g(x^N – 1) \\\ &= fg (x^N-1)^2 \\\ &= fg(x^{2N} – 2x^N + 1) \\\ &= fg(2 – 2x^N) \\\ &= 2(fg – x^Nfg) \end{aligned} \]

This has the same syntactical format as the original mapping $f \mapsto f – x^Nf$, with an extra factor of 2, and so its coefficients also have the form “repeat the coefficients and flip the sign of the second half” (times two). We can then do the “inverse mapping” by reading only the first half of the coefficients and dividing by 2.

def negacyclic_polymul_use_special_preimage(p1, p2):
    p1_preprocessed = numpy.concatenate([p1, -p1])
    p2_preprocessed = numpy.concatenate([p2, -p2])
    product = fft(p1_preprocessed) * fft(p2_preprocessed)
    inverted = ifft(product)
    rounded = numpy.round(0.5 * numpy.real(inverted)).astype(p1.dtype)
    return rounded[: p1.shape[0]]

Our chosen mapping $f \mapsto f-x^Nf$ is not particularly special, except that it uses a small number of pre and post-processing operations. For example, if you instead used the mapping $f \mapsto 2f + x^Nf$ (which would map back to $f$ exactly), then the FFT product would result in $5fg + 4x^Nfg$ in the larger ring. You can still read off the coefficients as before, but you’d have to divide by 5 instead of 2 (which, the superstitious would say, is harder). It seems that “double and negate” followed by “halve and take first half” is the least amount of pre/post processing possible.

Negacyclic polynomials with a “twist”

The previous section identified a nice mapping (or embedding) of the input polynomials into a larger ring. But studying that shows some symmetric structure in the FFT output. I.e., the coefficients of $f$ and $g$ are repeated twice, with some scaling factors. It also involves taking an FFT of two $2N$-dimensional vectors when we start from two $N$-dimensional vectors.

This sort of situation should make you think that we can do this more efficiently, either by using a smaller size FFT or by packing some data into the complex part of the input, and indeed we can do both.

[Aside: it’s well known that if all the entries of an FFT input are real, then the result also has symmetry that can be exploted for efficiency by reframing the problem as a size-N/2 FFT in some cases, and just removing half the FFT algorithm’s steps in other cases, see Wikipedia for more]

This technique was explained in Fast multiplication and its applications (pdf link) by Daniel Bernstein, a prominent cryptographer who specializes in cryptography performance, and whose work appears in widely-used standards like TLS, OpenSSH, and he designed a commonly used elliptic curve for cryptography.

[Aside: Bernstein cites this technique as using something called the “Tangent FFT (pdf link).” This is a drop-in FFT replacement he invented that is faster than previous best (split-radix FFT), and Bernstein uses it mainly to give a precise expression for the number of operations required to do the multiplication end to end. We will continue to use the numpy FFT implementation, since in this article I’m just focusing on how to express negacyclic multiplication in terms of the FFT. Also worth noting both the Tangent FFT and “Fast multiplication” papers frame their techniques—including FFT algorithm implementations!—in terms of polynomial ring factorizations and mappings. Be still, my beating cardioid.]

In terms of polynomial mappings, we start from the ring $\mathbb{R}[x] / (x^N + 1)$, where $N$ is a power of 2. We then pick a reversible mapping from $\mathbb{R}[x]/(x^N + 1) \to \mathbb{C}[x]/(x^{N/2} – 1)$ (note the field change from real to complex), apply the FFT to the image of the mapping, and reverse appropriately it at the end.

One such mapping takes two steps, first mapping $\mathbb{R}[x]/(x^N + 1) \to \mathbb{C}[x]/(x^{N/2} – i)$ and then from $\mathbb{C}[x]/(x^{N/2} – i) \to \mathbb{C}[x]/(x^{N/2} – 1)$. The first mapping is as easy as the last section, because $(x^N + 1) = (x^{N/2} + i) (x^{N/2} – i)$, and so we can just set $x^{N/2} = i$ and reduce the polynomial. This as the effect of making the second half of the polynomial’s coefficients become the complex part of the first half of the coefficients.

The second mapping is more nuanced, because we’re not just reducing via factorization. And we can’t just map $i \mapsto 1$ generically, because that would reduce complex numbers down to real values. Instead, we observe that (momentarily using an arbitrary degree $k$ instead of $N/2$), for any polynomial $f \in \mathbb{C}[x]$, the remainder of $f \mod x^k-i$ uniquely determines the remainder of $f \mod x^k – 1$ via the change of variables $x \mapsto \omega_{4k} x$, where $\omega_{4k}$ is a $4k$-th primitive root of unity $\omega_{4k} = e^{\frac{2 \pi i}{4k}}$. Spelling this out in more detail: if $f(x) \in \mathbb{C}[x]$ has remainder $f(x) = g(x) + h(x)(x^k – i)$ for some polynomial $h(x)$, then

\[ \begin{aligned} f(\omega_{4k}x) &= g(\omega_{4k}x) + h(\omega_{4k}x)((\omega_{4k}x)^{k} – i) \\\ &= g(\omega_{4k}x) + h(\omega_{4k}x)(e^{\frac{\pi i}{2}} x^k – i) \\\ &= g(\omega_{4k}x) + i h(\omega_{4k}x)(x^k – 1) \\\ &= g(\omega_{4k}x) \mod (x^k – 1) \end{aligned} \]

Translating this back to $k=N/2$, the mapping from $\mathbb{C}[x]/(x^{N/2} – i) \to \mathbb{C}[x]/(x^{N/2} – 1)$ is $f(x) \mapsto f(\omega_{2N}x)$. And if $f = f_0 + f_1x + \dots + f_{N/2 – 1}x^{N/2 – 1}$, then the mapping involves multiplying each coefficient $f_k$ by $\omega_{2N}^k$.

When you view polynomials as if they were a simple vector of their coefficients, then this operation $f(x) \mapsto f(\omega_{k}x)$ looks like $(a_0, a_1, \dots, a_n) \mapsto (a_0, \omega_{k} a_1, \dots, \omega_k^n a_n)$. Bernstein calls the operation a twist of $\mathbb{C}^n$, which I mused about in this Mathstodon thread.

What’s most important here is that each of these transformations are invertible. The first because the top half coefficients end up in the complex parts of the polynomial, and the second because the mapping $f(x) \mapsto f(\omega_{2N}^{-1}x)$ is an inverse. Together, this makes the preprocessing and postprocessing exact inverses of each other. The code is then

def negacyclic_polymul_complex_twist(p1, p2):
    n = p2.shape[0]
    primitive_root = primitive_nth_root(2 * n)
    root_powers = primitive_root ** numpy.arange(n // 2)

    p1_preprocessed = (p1[: n // 2] + 1j * p1[n // 2 :]) * root_powers
    p2_preprocessed = (p2[: n // 2] + 1j * p2[n // 2 :]) * root_powers

    p1_ft = fft(p1_preprocessed)
    p2_ft = fft(p2_preprocessed)
    prod = p1_ft * p2_ft
    ifft_prod = ifft(prod)
    ifft_rotated = ifft_prod * primitive_root ** numpy.arange(0, -n // 2, -1)

    return numpy.round(
        numpy.concatenate([numpy.real(ifft_rotated), numpy.imag(ifft_rotated)])
    ).astype(p1.dtype)

And so, at the cost of a bit more pre- and postprocessing, we can negacyclically multiply two degree $N-1$ polynomials using an FFT of length $N/2$. In theory, no information is wasted and this is optimal.

And finally, a simple matrix multiplication

The last technique I wanted to share is not based on the FFT, but it’s another method for doing negacyclic polynomial multiplication that has come in handy in situations where I am unable to use FFTs. I call it the Toeplitz method, because one of the polynomials is converted to a Toeplitz matrix. Sometimes I hear it referred to as a circulant matrix technique, but due to the negacyclic sign flip, I don’t think it’s a fully accurate term.

The idea is to put the coefficients of one polynomial $f(x) = f_0 + f_1x + \dots + f_{N-1}x^{N-1}$ into a matrix as follows:

\[ \begin{pmatrix} f_0 & -f_{N-1} & \dots & -f_1 \\\ f_1 & f_0 & \dots & -f_2 \\\ \vdots & \vdots & \ddots & \vdots \\\ f_{N-1} & f_{N-2} & \dots & f_0 \end{pmatrix} \]

The polynomial coefficients are written down in the first column unchanged, then in each subsequent column, the coefficients are cyclically shifted down one, and the term that wraps around the top has its sign flipped. When the second polynomial is treated as a vector of its coefficients, say, $g(x) = g_0 + g_1x + \dots + g_{N-1}x^{N-1}$, then the matrix-vector product computes their negacyclic product (as a vector of coefficients):

\[ \begin{pmatrix} f_0 & -f_{N-1} & \dots & -f_1 \\\ f_1 & f_0 & \dots & -f_2 \\\ \vdots & \vdots & \ddots & \vdots \\\ f_{N-1} & f_{N-2} & \dots & f_0 \end{pmatrix} \begin{pmatrix} g_0 \\\ g_1 \\\ \vdots \\\ g_{N-1} \end{pmatrix} \]

This works because each row $j$ corresponds to one output term $x^j$, and the cyclic shift for that row accounts for the degree-wrapping, with the sign flip accounting for the negacyclic part. (If there were no sign attached, this method could be used to compute a cyclic polynomial product).

The Python code for this is

def cylic_matrix(c: numpy.array) -> numpy.ndarray:
    """Generates a cyclic matrix with each row of the input shifted.

    For input: [1, 2, 3], generates the following matrix:

        [[1 2 3]
         [2 3 1]
         [3 1 2]]
    """
    c = numpy.asarray(c).ravel()
    a, b = numpy.ogrid[0 : len(c), 0 : -len(c) : -1]
    indx = a + b
    return c[indx]


def negacyclic_polymul_toeplitz(p1, p2):
    n = len(p1)

    # Generates a sign matrix with 1s below the diagonal and -1 above.
    up_tri = numpy.tril(numpy.ones((n, n), dtype=int), 0)
    low_tri = numpy.triu(numpy.ones((n, n), dtype=int), 1) * -1
    sign_matrix = up_tri + low_tri

    cyclic_matrix = cylic_matrix(p1)
    toeplitz_p1 = sign_matrix * cyclic_matrix
    return numpy.matmul(toeplitz_p1, p2)

Obviously on most hardware this would be less efficient than an FFT-based method (and there is some relationship between circulant matrices and Fourier Transforms, see Wikipedia). But in some cases—when the polynomials are small, or one of the two polynomials is static, or a particular hardware choice doesn’t handle FFTs with high-precision floats very well, or you want to take advantage of natural parallelism in the matrix-vector product—this method can be useful. It’s also simpler to reason about.

Until next time!

Want to respond? Send me an email, post a webmention, or find me elsewhere on the internet.

DOI: https://doi.org/10.59350/gxhag-cdd26