Encoding Schemes in FHE

In cryptography, we need a distinction between a cleartext and a plaintext. A cleartext is a message in its natural form. A plaintext is a cleartext that is represented in a specific way to prepare it for encryption in a specific scheme. The process of taking a cleartext and turning it into a plaintext is called encoding, and the reverse is called decoding.

In homomorphic encryption, the distinction matters. Cleartexts are generally all integers, though the bit width of allowed integers can be restricted (e.g., 4-bit integers). On the other hand, each homomorphic encryption (HE) scheme has its own special process for encoding and decoding, and since HEIR hopes to support all HE schemes, I set about cataloguing the different encoding schemes. This article is my notes on what they are.

If you’re not familiar with the terms Learning With Errors LWE and and its ring variant RLWE, then you may want to read up on those Wikipedia pages first. These problems are fundamental to most FHE schemes.

Bit field encoding for LWE

A bit field encoding simply places the bits of a small integer cleartext within a larger integer plaintext. An example might be a 3-bit integer cleartext placed in the top-most bits of a 32-bit integer plaintext. This is necessary because operations on FHE ciphertexts accumulate noise, which pollutes the lower-order bits of the corresponding plaintext (BGV is a special case that inverts this, see below).

Many papers in the literature will describe “placing in the top-most bits” as “applying a scaling factor,” which essentially means pick a power of 2 $\Delta$ and encode an integer $x$ as $\Delta x$. However, by using a scaling factor it’s not immediately clear if all of the top-most bits of the plaintext are safe to use.

To wit, the CGGI (aka TFHE) scheme has a slightly more specific encoding because it requires the topmost bit to be zero in order to use its fancy programmable bootstrapping feature. Don’t worry if you don’t know what it means, but just know that in this scheme the top-most bit is set aside.

This encoding is hence most generally described by specifying a starting bit and a bit width for the location of the cleartext in a plaintext integer. The code would look like

plaintext = message << (plaintext_bit_width - starting_bit - cleartext_bit_width)

There are additional steps that come into play when one wants to encode a decimal value in LWE, which can be done with fixed point representations.

As mentioned above, the main HE scheme that uses bit field LWE encodings is CGGI, but all the schemes use this encoding as part of their encoding because all schemes need to ensure there is space for noise growth during FHE operations.

Coefficient encoding for RLWE

One of the main benefits of RLWE-based FHE schemes is that you can pack lots of cleartexts into one plaintext. For this and all the other RLWE-based sections, the cleartext space is something like $(\mathbb{Z}/3\mathbb{Z})^{1024}$, vectors of small integers of some dimension. Many folks in the FHE world call $p$ the modulus of the cleartexts. And the plaintext space is something like $(\mathbb{Z}/2^{32}\mathbb{Z})[x] / (x^{1024} + 1)$, i.e., polynomials with large integer coefficients and a polynomial degree matching the cleartext space dimension. Many people call $q$ the coefficient modulus of the plaintext space.

In the coefficient encoding for RLWE, the bit-field encoding is applied to each input, and they are interpreted as coefficients of the polynomial.

This encoding scheme is also used in CGGI, in order to encrypt a lookup table as a polynomial for use in programmable bootstrapping. But it can also be used (though it is rarely used) in the BGV and BFV schemes, and rarely because both of those schemes use the polynomial multiplication to have semantic meaning. When you encode RLWE with the coefficient encoding, polynomial multiplication corresponds to a convolution of the underlying cleartexts, when most of the time those schemes prefer that multiplication corresponds to some kind of point-wise multiplication. The next encoding will handle that exactly.

Evaluation encoding for RLWE

The evaluation encoding borrows ideas from the Discrete Fourier Transform literature. See this post for a little bit more about why the DFT and polynomial multiplication are related.

The evaluation encoding encodes a vector $(v_1, \dots, v_N)$ by interpreting it as the output value of a polynomial $p(x)$ at some implicitly determined, but fixed points. These points are usually the roots of unity of $x^N + 1$ in the ring $\mathbb{Z}/q\mathbb{Z}$ (recall, the coefficients of the polynomial ring), and one computes this by picking $q$ in such a way that guarantees the multiplicative group $(\mathbb{Z}/q\mathbb{Z})^\times$ has a generator, which plays the analogous role of a $2N$-th root of unity that you would normally see in the complex numbers.

Once you have the root of unity, you can convert from the evaluation form to a coefficient form (which many schemes need for the encryption step) via an inverse number-theoretic transform (INTT). And then, of course, one must scale the coefficients using the bit field encoding to give room for noise. The coefficient form here is considered the “encoded” version of the cleartext vector.

Aside: one can perform the bit field encoding step before or after the INTT, since the bitfield encoding is equivalent to multiplying by a constant, and scaling a polynomial by a constant is equivalent to scaling its point evaluations by the same constant. Polynomial evaluation is a linear function of the coefficients.

The evaluation encoding is the most commonly used encoding used for both the BGV and BFV schemes. And then after encryption is done, one usually NTT’s back to the evaluation representation so that polynomial multiplication can be more quickly implemented as entry-wise multiplication.

Rounded canonical embeddings for RLWE

This embedding is for a family of FHE schemes related to the CKKS scheme, which focuses on approximate computation.

Here the cleartext space and plaintext spaces change slightly. The cleartext space is $\mathbb{C}^{N/2}$, and the plaintext space is again $(\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1)$ for some machine-word-sized power of two $q$. As you’ll note, the cleartext space is continuous but the plaintext space is discrete, so this necessitates some sort of approximation.

Aside: In the literature you will see the plaintext space described as just $(\mathbb{Z}[x] / (x^N + 1)$, and while this works in principle, in practice doing so requires multiprecision integer computations, and ends up being slower than the alternative, which is to use a residue number system before encoding, and treat the plaintext space as $(\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1)$. I’ll say more about RNS encoding in the next section.

The encoding is easier to understand by first describing the decoding step. Given a polynomial $f \in (\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1)$, there is a map called the canonical embedding $\varphi: (\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1) \to \mathbb{C}^N$ that evaluates $f$ at the odd powers of a primitive $2N$-th root of unity. I.e., letting $\omega = e^{2\pi i / 2N}$, we have

\[ \varphi(f) = (f(\omega), f(\omega^3), f(\omega^5), \dots, f(\omega^{2N-1})) \]

Aside: My algebraic number theory is limited (not much beyond a standard graduate course covering Galois theory), but this paper has some more background. My understanding is that we’re viewing the input polynomials as actually sitting inside the number field $\mathbb{Q}[x] / (x^N + 1)$ (or else $q$ is a prime and the original polynomial ring is a field), and the canonical embedding is a specialization of a more general theorem that says that for any subfield $K \subset \mathbb{C}$, the Galois group $K/\mathbb{Q}$ is exactly the set of injective homomorphisms $K \to \mathbb{C}$. I don’t recall exactly why these polynomial quotient rings count as subfields of $\mathbb{C}$, and I think it is not completely trivial (see, e.g., this stack exchange question).

As specialized to this setting, the canonical embedding is a scaled isometry for the 2-norm in both spaces. See this paper for a lot more detail on that. This is a critical aspect of the analysis for FHE, since operations in the ciphertext space add perturbations (noise) in the plaintext space, and it must be the case that those perturbations decode to similar perturbations so that one can use bounds on noise growth in the plaintext space to ensure the corresponding cleartexts stay within some desired precision.

Because polynomials commute with complex conjugation ($f(\overline{z}) = \overline{f(z)}$), and roots of unity satisfy $\overline{\omega^k} = \omega^{-k}$, this canonical embedding is duplicating information. We can throw out the second half of the roots of unity and retain the same structure (the scaling in the isometry changes as well). The result is that the canonical embedding is defined $\varphi: (\mathbb{Z}/q\mathbb{Z})[x] / (x^N + 1) \to \mathbb{C}^{N/2}$ via

\[ \varphi(f) = (f(\omega), f(\omega^3), \dots, f(\omega^{N-1})) \]

Since we’re again using the bit-field encoding to scale up the inputs for noise, the decoding is then defined by applying the canonical embedding, and then applying bit-field decoding (scaling down).

This decoding process embeds the discrete polynomial space inside $\mathbb{C}^{N/2}$ as a lattice, but input cleartexts need not lie on that lattice. And so we get to the encoding step, which involves rounding to a point on the lattice, then inverting the canonical embedding, then applying the bit-field encoding to scale up for noise.

Using commutativity, one can more specifically implement this by first inverting the canonical embedding (which again uses an FFT-like operation), the result of which is in $\mathbb{C}[x] / (x^N + 1)$, then apply the bit-field encoding to scale up, then round the coefficients to be in $\mathbb{Z}[x] / (x^N + 1)$. As mentioned above, if you want the coefficients to be machine-word-sized integers, you’ll have to design this all to ensure the outputs are sufficiently small, and then treat the output as $\mathbb{Z}/q\mathbb{Z}[x] / (x^N + 1)$. Or else use a RNS mechanism.

Residue Number System Pre-processing

In all of the above schemes, the cleartext spaces can be too small for practical use. In the CGGI scheme, for example, a typical cleartext space is only 3 or 4 bits. Some FHE schemes manage this by representing everything in terms of boolean circuits, and pack inputs to various boolean gates in those bits. That is what I’ve mainly focused on, but it has the downside of increasing the number of FHE operations, requiring deeper circuits and more noise management operations, which are slow. Other approaches try to use the numerical structure of the ciphertexts more deliberately, and Sunzi’s Theorem (colloquially known as the Chinese Remainder Theorem) comes to the rescue here.

There will be two “cleartext” spaces floating around here, one for the “original” message, which I’ll call the “original” cleartext space, and one for the Sunzi’s-theorem-decomposed message, which I’ll call the “RNS” cleartext space (RNS for residue number system).

The original cleartext space size $M$ must be a product of primes or co-prime integers $M = m_1 \cdot \dots \cdot m_r$, with each $m_i$ being small enough to be compatible with the desired FHE’s encoding. E.g., for a bit-field encoding, $M$ might be large, but each $m_i$ would have to be at most a 4-bit prime (which severely limits how much we can decompose).

Then, we represent a single original cleartext message $x \in \mathbb{Z}/M\mathbb{Z}$ via its residues mod each $m_i$. I.e., $x$ becomes $r$ different cleartexts $(x \mod m_1, x \mod m_2, \dots, x \mod m_r)$ in the RNS cleartext space. From there we can either encode all the cleartexts in a single plaintext—the various RLWE encodings support this so long as $r < N$ (or $N/2$ for the canonical embedding))—or else encode them as difference plaintexts. In the latter case, the executing program needs to ensure the plaintexts are jointly processed. E.g., any operation that happens to one must happen to all, to ensure that the residues stay in sync and can be reconstructed at the end.

And finally, after decoding we use the standard reconstruction algorithm from Sunzi’s theorem to rebuild the original cleartext from the decoded RNS cleartexts.

I’d like to write a bit more about RNS decompositions and Sunzi’s theorem in a future article, because it is critical to how many FHE schemes operate, and influences a lot of their designs. For example, I glazed over how inverting the canonical embedding works in detail, and it is related to Sunzi’s theorem in a deep way. So more on that in the future.

The Gadget Decomposition in FHE

Lately I’ve been studying Fully Homomorphic Encryption, which is the miraculous ability to perform arbitrary computations on encrypted data without learning any information about the underlying message. It’s the most comprehensive private computing solution that can exist (and it does exist!).

The first FHE scheme by Craig Gentry was based on ideal lattices and was considered very complex (I never took the time to learn how it worked). Some later schemes (GSW = Gentry-Sahai-Waters) are based on matrix multiplication, and are conceptually much simpler. Even more recent FHE schemes build on GSW or use it as a core subroutine.

All of these schemes inject random noise into the ciphertext, and each homomorphic operation increases noise. Once the noise gets too big, you can no longer decrypt the message, and so every now and then you must apply a process called “bootstrapping” that reduces noise. It also tends to be the performance bottleneck of any FHE scheme, and this bottleneck is why FHE is not considered practical yet.

To help reduce noise growth, many FHE schemes like GSW use a technical construction dubbed the gadget decomposition. Despite the terribly vague name, it’s a crucial limitation on noise growth. When it shows up in a paper, it’s usually remarked as “well known in the literature,” and the details you’d need to implement it are omitted. It’s one of those topics.

So I’ll provide some details. The code from this post is on GitHub.

Binary digit decomposition

To create an FHE scheme, you need to apply two homomorphic operations to ciphertexts: addition and multiplication. Most FHE schemes admit one of the two operations trivially. If the ciphertexts are numbers as in RSA, you multiply them as numbers and that multiplies the underlying messages, but addition is not known to be possible. If ciphertexts are vectors as in the “Learning With Errors” scheme (LWE)—the basis of many FHE schemes—you add them as vectors and that adds the underlying messages. (Here the “Error” in LWE is synonymous with “random noise”, I will use the term “noise”) In LWE and most FHE schemes, a ciphertext hides the underlying message by adding random noise, and addition of two ciphertexts adds the corresponding noise. After too many unmitigated additions, the noise will grow so large it obstructs the message. So you stop computing, or you apply a bootstrapping operation to reduce the noise.

Most FHE schemes also allow you to multiply a ciphertext by an unencrypted constant $ A$, but then the noise scales by a factor of $ A$, which is undesirable if $ A$ is large. So you either need to limit the coefficients of your linear combinations by some upper bound, or use a version of the gadget decomposition.

The simplest version of the gadget decomposition works like this. Instead of encrypting a message $ m \in \mathbb{Z}$, you would encrypt $ m, 2m, 4m, …, 2^{k-1} m$ for some choice of $ k$, and then to multiply $ A < 2^k$ you write the binary digits of $ A = \sum_{i=0}^{k-1} a_i 2^i$ and you compute $ \sum_{i=0}^{k-1} a_i \textup{Enc}(2^i m)$. If the noise in each encryption is $ E$, and summing ciphertexts sums noise, then this trick reduces the noise growth from $ O(AE)$ to $ O(kE) = O(\log(A)E)$, at the cost of tracking $ k$ ciphertexts. (Calling the noise $ E$ is a bit of an abuse—in reality the error is sampled from a random distribution—but hopefully you see my point).

Some folks call the mapping $ \textup{PowersOf2}(m) = m \cdot (2^0, 2^1, 2^2, \dots, 2^{k-1})$, and for the sake of this article let’s call the operation of writing a number $ A$ in terms of its binary digits $ \textup{Bin}(A) = (a_0, \dots, a_{k-1})$ (note, the first digit is the least-significant bit, i.e., it’s a little-endian representation). Then PowersOf2 and Bin expand an integer product into a dot product, while shifting powers of 2 from one side to the other.

$ \displaystyle A \cdot m = \langle \textup{Bin}(A), \textup{PowersOf2}(m) \rangle$

This inspired the following “proof by meme” that I can’t resist including.

Working out an example, if the message is $ m=7$ and $ A = 100, k=7$, then $ \textup{PowersOf2}(7) = (7, 14, 28, 56, 112, 224, 448, 896)$ and $ \textup{Bin}(A) = (0,0,1,0,0,1,1,0)$ (again, little-endian), and the dot product is

$ \displaystyle 28 \cdot 1 + 224 \cdot 1 + 448 \cdot 1 = 700 = 7 \cdot 2^2 + 7 \cdot 2^5 + 7 \cdot 2^6$

A generalized gadget construction

One can generalize the binary digit decomposition to different bases, or to vectors of messages instead of a single message, or to include a subset of the digits for varying approximations. I’ve been puzzling over an FHE scheme that does all three. In my search for clarity I came across a nice paper of Genise, Micciancio, and Polyakov called “Building an Efficient Lattice Gadget Toolkit: Subgaussian Sampling and More“, in which they state a nice general definition.

Definition: For any finite additive group $ A$, an $ A$-gadget of size $ w$ and quality $ \beta$ is a vector $ \mathbf{g} \in A^w$ such that any group element $ u \in A$ can be written as an integer combination $ u = \sum_{i=1}^w g_i x_i$ where $ \mathbf{x} = (x_1, \dots , x_w)$ has norm at most $ \beta$.

The main groups considered in my case are $ A = (\mathbb{Z}/q\mathbb{Z})^n$, where $ q$ is usually $ 2^{32}$ or $ 2^{64}$, i.e., unsigned int sizes on computers for which we get free modulus operations. In this case, a $ (\mathbb{Z}/q\mathbb{Z})^n$-gadget is a matrix $ G \in (\mathbb{Z}/q\mathbb{Z})^{n \times w}$, and the representation $ x \in \mathbb{Z}^w$ of $ u \in (\mathbb{Z}/q\mathbb{Z})^n$ satisfies $ Gx = u$.

Here $ n$ and $ q$ are fixed, and $ w, \beta$ are traded off to make the chosen gadget scheme more efficient (smaller $ w$) or better at reducing noise (smaller $ \beta$). An example of how this could work is shown in the next section by generalizing the binary digit decomposition to an arbitrary base $ B$. This allows you to use fewer digits to represent the number $ A$, but each digit may be as large as $ B$ and so the quality is $ \beta = O(B\sqrt{w})$.

One commonly-used construction is to convert an $ A$-gadget to an $ A^n$-gadget using the Kronecker product. Let $ g \in A^w$ be an $ A$-gadget of quality $ \beta$. Then the following matrix is an $ A^n$-gadget of size $ nw$ and quality $ \sqrt{n} \beta$:

$ \displaystyle G = I_n \otimes \mathbf{g}^\top = \begin{pmatrix} g_1 & \dots & g_w & & & & & & & \\ & & & g_1 & \dots & g_w & & & & \\ & & & & & & \ddots & & & \\ & & & & & & & g_1 & \dots & g_w \end{pmatrix}$

Blank spaces represent zeros, for clarity.

An example with $ A = (\mathbb{Z}/16\mathbb{Z})$. The $ A$-gadget is $ \mathbf{g} = (1,2,4,8)$. This has size $ 4 = \log(q)$ and quality $ \beta = 2 = \sqrt{1+1+1+1}$. Then for an $ A^3$-gadget, we construct

Now given a vector $ (15, 4, 7) \in \mathbb{A}^3$ we write it as follows, where each little-endian representation is concatenated into a single vector.

$ \displaystyle \mathbf{x} = \begin{pmatrix} 1\\1\\1\\1\\0\\0\\1\\0\\1\\1\\1\\0 \end{pmatrix}$

And finally,

To use the definition more rigorously, if we had to write the matrix above as a gadget “vector”, it would be in column order from left to right, $ \mathbf{g} = ((1,0,0), (2,0,0), \dots, (0,0,8)) \in A^{wn}$. Since the vector $ \mathbf{x}$ can be at worst all 1’s, its norm is at most $ \sqrt{12} = \sqrt{nw} = \sqrt{n} \beta = 2 \sqrt{3}$, as claimed above.

A signed representation in base B

As we’ve seen, the gadget decomposition trades reducing noise for a larger ciphertext size. With integers modulo $ q = 2^{32}$, this can be fine-tuned a bit more by using a larger base. Instead of PowersOf2 we could define PowersOfB, where $ B = 2^b$, such that $ B$ divides $ 2^{32}$. For example, with $ b = 8, B = 256$, we would only need to track 4 ciphertexts. And the gadget decomposition of the number we’re multiplying by would be the little-endian digits of its base-$ B$ representation. The cost here is that the maximum entry of the decomposed representation is 255.

We can fine tune this a little bit more by using a signed base-$ B$ representation. To my knowledge this is not the same thing as what computer programmers normally refer to as a signed integer, nor does it have anything to do with the two’s complement representation of negative numbers. Rather, instead of the normal base-$ B$ digits $ n_i \in \{ 0, 1, \dots, B-1 \}$ for a number $ N = \sum_{i=0}^k n_i B^i$, the signed representation chooses $ n_i \in \{ -B/2, -B/2 + 1, \dots, -1, 0, 1, \dots, B/2 – 1 \}$.

Computing the digits is slightly more involved, and it works by shifting large coefficients by $ -B/2$, and “absorbing” the impact of that shift into the next more significant digit. E.g., if $ B = 256$ and $ N = 2^{11} – 1$ (all 1s up to the 10th digit), then the unsigned little-endian base-$ B$ representation of $ N$ is $ (255, 7) = 255 + 7 \cdot 256$. The corresponding signed base-$ B$ representation subtracts $ B$ from the first digit, and adds 1 to the second digit, resulting in $ (-1, 8) = -1 + 8 \cdot 256$. This works in general because of the following “add zero” identity, where $ p$ and $ q$ are two successive unsigned digits in the unsigned base-$ B$ representation of a number.

$ \displaystyle \begin{aligned} pB^{k-1} + qB^k &= pB^{k-1} – B^k + qB^k + B^k \\ &= (p-B)B^{k-1} + (q+1)B^k \end{aligned}$

Then if $ q+1 \geq B/2$, you’d repeat and carry the 1 to the next higher coefficient.

The result of all this is that the maximum absolute value of a coefficient of the signed representation is halved from the unsigned representation, which reduces the noise growth at the cost of a slightly more complex representation (from an implementation standpoint). Another side effect is that the largest representable number is less than $ 2^{32}-1$. If you try to apply this algorithm to such a large number, the largest digit would need to be shifted, but there is no successor to carry to. Rather, if there are $ k$ digits in the unsigned base-$ B$ representation, the maximum number representable in the signed version has all digits set to $ B/2 – 1$. In our example with $ B=256$ and 32 bits, the largest digit is 127. The formula for the max representable integer is $ \sum_{i=0}^{k-1} (B/2 – 1) B^i = (B/2 – 1)\frac{B^k – 1}{B-1}$.

max_digit = base // 2 - 1
max_representable = (max_digit 
    * (base ** (num_bits // base_log) - 1) // (base - 1)

A simple python implementation computes the signed representation, with code copied below, in which $ B=2^b$ is the base, and $ b = \log_2(B)$ is base_log.

def signed_decomposition(
  x: int, base_log: int, total_num_bits=32) -> List[int]:
    result = []
    base = 1 << base_log
    digit_mask = (1 << base_log) - 1
    base_over_2_threshold = 1 << (base_log - 1)
    carry = 0

    for i in range(total_num_bits // base_log):
        unsigned_digit = (x >> (i * base_log)) & digit_mask
        if carry:
            unsigned_digit += carry
            carry = 0

        signed_digit = unsigned_digit
        if signed_digit >= base_over_2_threshold:
            signed_digit -= base
            carry = 1

    return result

In a future article I demonstrate the gadget decomposition in action in a practical setting called key switching, which allows one to convert an LWE ciphertext encrypted with key $ s_1$ into an LWE ciphertext encrypted with a different key $ s_2$. This operation increases noise, and so the gadget decomposition is used to reduce noise growth. Key switching is used in FHE because some operations (like bootstrapping) have the side effect of switching the encryption key.

Until then!

Searching for RH Counterexamples — Unbounded Integers

We’re ironically searching for counterexamples to the Riemann Hypothesis.

  1. Setting up Pytest
  2. Adding a Database
  3. Search strategies

In the last article, we improved our naive search from “try all positive integers” to enumerate a subset of integers (superabundant numbers), which RH counterexamples are guaranteed to be among. These numbers grow large, fast, and we quickly reached the limit of what 64 bit integers can store.

Unbounded integer arithmetic is possible on computers, but it requires a special software implementation. In brief, you represent numbers in base-N for some large N (say, $ 2^{32}$), and then use a 32-bit integer for each digit. Arithmetic on such quantities emulates a ripple-carry adder, which naturally requires linear time in the number of digits of each operand. Artem Golubin has a nice explanation of how Python does it internally.

So Python can handle unbounded integer arithmetic, but neither numba nor our database engine do. Those both crash when exceeding 64-bit integers This is a problem because we won’t be able to store the results of our search without being able to put it in a database. This leaves us with a classic software engineering problem. What’s the path forward?

Exploring Alternatives

The impulse answer is to do as little as possible to make the damn thing work. In a situation where the software you’re writing is a prototype, and you expect it to be rewritten from scratch in the future, this is an acceptable attitude. That said, experienced engineers would caution you that, all too often, such “prototypes” are copy-pasted to become janky mission-critical systems for years.

In pretending this is the “real thing,” let’s do what real engineers would do and scope out some alternatives before diving in. The two aspects are our database and the use of numba for performance.

Let’s start with the database. A quick and dirty option: store all numbers as text strings in the database. There’s no limit on the size of the number in that case. The benefit: we don’t need to use a different database engine, and most of our code stays the same. The cost: we can’t use numeric operations in database queries, which would make further analysis and fetching awkward. In particular, we can’t even apply sorting operations, since text strings are sorted lexicographically (e.g., 100, 25) while numbers are sorted by magnitude (25, 100). Note, we applied this “numbers as text” idea to the problem of serializing the search state, and it was hacky there, too.

A second option is to find a database engine with direct support for unbounded-integer arithmetic. The benefit: fast database queries and the confidence that it will support future use cases well. The cost: if our existing sqlite-based interface doesn’t work with the new database engine, we’d have to write another implementation of our database interface.

For numba, we have at least three options. First, fall back to native python arithmetic, which is slow. Second, implement arbitrary-precision arithmetic in Python in a way that numba can compile it. Third, find (or implement) a C-implementation of arbitrary precision integer arithmetic, provide Python bindings, and optionally see if it can work with (or replace) numba. As I write this I haven’t yet tried any of these options. My intuition tells me the best way to go would be to find “proper” support for arbitrary precision integers.

For the database, I recall that the Postgres database engine supports various extensions, for example this extension that adds support for geographic objects. Postgres’s extension framework demonstrates an important software engineering principle that many of the best projects follow: “closed for modification, open for extension.” That is, Postgres is designed so that others can contribute new features to Postgres without requiring the Postgres team to do anything special—specifically, they don’t have to change Postgres to accommodate it. The name for this sometimes goes by extensions, or plug-ins, hooks, or (at a lower level) callbacks. Github Actions is a good example of this.

Geographic objects are almost certainly more complicated than arbitrary precision integers, so chances are good a Postgres extension exists for the latter. Incorporating it would involve migrating to Postgres, finding and installing that extension, and then converting the C library representation above to whatever representation Postgres accepts in a query.

A good route will also ensure that we need not change our tests too much, since all we’re doing here is modifying implementations. We’ll see how well that holds up.

gmp and pgmp

After some digging, I found GMP (GNU Multiple Precision), a C library written by Torbjörn Granlund. It has a Python bindings library called gmpy that allows Python to use an “mpz” (“Multiple Precision $ \mathbb{Z}$”) type as a drop-in replacement for Python integers. And I found a PostgreSQL extension called pgmp. The standard Python library for Postgres is psycopg2, which was written by the same person who wrote pgmp, Daniele Varrazzo.

To start, I ran a timing test of gmpy, which proves to be as fast as numba. This pull request has the details.

It took a small bit of kicking to get pgmp to install, but then I made a test database that uses the new column type mpz and stores the value $ 2^{513}$.

postgres=# create database pgmp_test;
postgres=# \connect pgmp_test;
You are now connected to database "pgmp_test" as user "jeremy".
pgmp_test=# CREATE EXTENSION pgmp;
pgmp_test=# create table test_table (id int4, value mpz);
pgmp_test=# insert into test_table
pgmp_test-# values (1, 2::mpz ^ 513);
pgmp_test=# select * from test_table;
 id |                                                                            value                                                                            
  1 | 26815615859885194199148049996411692254958731641184786755447122887443528060147093953603748596333806855380063716372972101707507765623893139892867298012168192
(1 row)

Now I’m pretty confident this approach will work.

This pull request includes the necessary commits to add a postgres implementation of our database interface, add tests (which is a minor nuisance).

Then this pull request converts the main divisor computation functions to use gmpy, and this final commit converts the main program to use the postgres database.

This exposed one bug, that I wasn’t converting the new mpz types properly in the postgres sql query. This commit fixes it, and this commit adds a regression test to catch that specific error going forward.

Results and next steps

With all that work, I ran the counterexample search for a few hours.

When I stopped it, it had checked all possibly-superabundant numbers whose prime factorizations have at most 75 prime factors, including multiplicity. Since all possible counterexamples to the RH must be superabundant, and all superabundant numbers have the aforementioned special prime factorization, we can say it more simply. I ruled out all positive integers whose prime factorization has at most 75 factors.

The top 10 are:

divisor=# select n, witness_value 
from RiemannDivisorSums 
where witness_value > 1.7 and n > 5040 
order by witness_value desc 
limit 10;
                                                                         n                                                                          |   witness_value    
 7837096340441581730115353880089927210115664131849557062713735873563599357074419807246597145310377220030504976899588686851652680862494751024960000  | 1.7679071291526642
 49445402778811241199465955079431717413978953513246416799455746836363402883750282695562127099750014006501608687063651021146073696293342277760000    |  1.767864530684858
 24722701389405620599732977539715858706989476756623208399727873418181701441875141347781063549875007003250804343531825510573036848146671138880000    |  1.767645098171234
 157972532839652527793820942745788234549453525601426251755449670403716942120607931934703281468849885004797471843653837128262216282087355520000      | 1.7676163327497005
 2149800120817880052150693699105726844086041457097670295628510732015800125380447073720092482597826695934852551611463087875916247664927925120000     |  1.767592584103948
 340743319149633988265884951308257704787637570949980741857118951024504319872800861184634658491755531305674129430416899428332725254891076131520000   |  1.767582883432923
 23511289021324745190346061640269781630346992395548671188141207620690798071223259421739791435931131660091514930698766060554958042587484253074880000 | 1.7674462177172812
 507950266365442211555694349664913937458049921547994378634886400011951582381375986928306371282475514484879330686989829994412271003496320000         | 1.7674395010995763
 78986266419826263896910471372894117274726762800713125877724835201858471060303965967351640734424942502398735921826918564131108141043677760000       | 1.7674104158678667
 6868370993028370773644388815034271067367544591366358771976072626248562700895997040639273107341299348034672688854514657750531142699450240000        | 1.7674059308384011

This is new. We’ve found quite a few numbers that have a better witness value than $ n = 10080$ which achieves ~1.7558. The best is


which achieves ~1.7679. Recall the 1.781 threshold needed to be a RH counterexample. We’re about 50% of the way toward disproving RH. How much more work could it take?

But seriously, what’s next with this project? For one, even though we have some monstrous numbers and their divisor sums and witness values, it’s hard to see the patterns in them through a SQL queries. It would be nicer to make some pretty plots.

I could also take a step back and see what could be improved from a software engineering standpoint. For one, not all parts of the application are tested, and tests aren’t automatically run when I make changes. This enabled the bug above where I didn’t properly convert mpz types before passing them to SQL upsert statements. For two, while I have been using type annotations in some places, they aren’t checked, and the switch to mpz has almost certainly made many of the type hints incorrect. I could fix that and set up a test that type checks.

Finally, in the interest of completeness, I could set up a front end that displays some summary of the data, and then deploy the whole application so that it has a continuously-running background search, along with a website that shows how far along the search is. Based on how long the SQL statement to find the top 10 witness values took, this would also likely require some caching, which fits snugly in the class of typical software engineering problems.

Let me know what you’re interested in.

Searching for RH Counterexamples — Adding a Database

In the last article we set up pytest for a simple application that computes divisor sums $ \sigma(n)$ and tries to disprove the Riemann Hypothesis. In this post we’ll show how to extend the application as we add a database dependency. The database stores the computed sums so we can analyze them after our application finishes.

As in the previous post, I’ll link to specific git commits in the final code repository to show how the project evolves. You can browse or checkout the repository at each commit to see how it works.

Interface before implementation

The approach we’ll take is one that highlights the principle of good testing and good software design: separate components by thin interfaces so that the implementations of those interfaces can change later without needing to update lots of client code.

We’ll take this to the extreme by implementing and testing the logic for our application before we ever decide what sort of database we plan to use! In other words, the choice of database will be our last choice, making it inherently flexible to change. That is, first we iron out a minimal interface that our application needs, and then choose the right database based on those needs. This is useful because software engineers often don’t understand how the choice of a dependency (especially a database dependency) will work out long term, particularly as a prototype starts to scale and hit application-specific bottlenecks. Couple this with the industry’s trend of chasing hot new fads, and eventually you realize no choice is sacred. Interface separation is the software engineer’s only defense, and their most potent tool for flexibility. As a side note, Tom Gamon summarizes this attitude well in a recent article, borrowing the analogy from a 1975 investment essay The Winner’s Game by Charles Ellis. Some of his other articles reinforce the idea that important decisions should be made as late as possible, since that is the only time you know enough to make those decisions well.

Our application has two parts so far: adding new divisor sums to the database, and loading divisor sums for analysis. Since we’ll be adding to this database over time, it may also be prudent to summarize the contents of the database, e.g. to say what’s the largest computed integer. This suggests the following first-pass interface, implemented in this commit.

class DivisorDb(ABC):
    def load() -> List[RiemannDivisorSum]:
        '''Load the entire database.'''

    def upsert(data: List[RiemannDivisorSum]) -> None:
        '''Insert or update data.'''

    def summarize() -> SummaryStats:
        '''Summarize the contents of the database.'''

RiemannDivisorSum and SummaryStats are dataclasses. These are special classes that are intended to have restricted behavior: storing data and providing simple derivations on that data. For us this provides a stabler interface because the contents of the return values can change over time without interrupting other code. For example, we might want to eventually store the set of divisors alongside their sum. Compare this to returning a list or tuple, which is brittle when used with things like tuple assignment.

The other interesting tidbit about the commit is the use of abstract base classes (“ABC”, an awful name choice). Python has limited support for declaring an “interface” as many other languages do. The pythonic convention was always to use its “duck-typing” feature, which meant to just call whatever methods you want on an object, and then any object that supports has those methods can be used in that spot. The mantra was, “if it walks like a duck and talks like a duck, then it’s a duck.” However, there was no way to say “a duck is any object that has a waddle and quack method, and those are the only allowed duck functions.” As a result, I often saw folks tie their code to one particular duck implementation. That said, there were some mildly cumbersome third party libraries that enabled interface declarations. Better, recent versions of Python introduced the abstract base class as a means to enforce interfaces, and structural subtyping (typing.Protocol) to interact with type hints when subtyping directly is not feasible (e.g., when the source is in different codebases).

Moving on, we can implement an in-memory database that can be used for testing. This is done in this commit. One crucial aspect of these tests is that they do not rely on the knowledge that the in-memory database is secretly a dictionary. That is, the tests use only the DivisorDb interface and never inspect the underlying dict. This allows the same tests to run against all implementations, e.g., using pytest.parameterize. Also note it’s not thread safe or atomic, but for us this doesn’t really matter.

Injecting the Interface

With our first-pass database interface and implementation, we can write the part of the application that populates the database with data. A simple serial algorithm that computes divisor sums in batches of 100k until the user hits Ctrl-C is done in this commit.

def populate_db(db: DivisorDb, batch_size: int = 100000) -> None:
    '''Populate the db in batches.'''
    starting_n = (db.summarize().largest_computed_n or 5040) + 1
    while True:
        ending_n = starting_n + batch_size
        db.upsert(compute_riemann_divisor_sums(starting_n, ending_n))
        starting_n = ending_n + 1

I only tested this code manually. The reason is that line 13 (highlighted in the abridged snippet above) is the only significant behavior not already covered by the InMemoryDivisorDb tests. (Of course, that line had a bug later fixed in this commit). I’m also expecting to change it soon, and spending time testing vs implementing features is a tradeoff that should not always fall on the side of testing.

Next let’s swap in a SQL database. We’ll add sqlite3, which comes prepackaged with python, so needs no dependency management. The implementation in this commit uses the same interface as the in-memory database, but the implementation is full of SQL queries. With this, we can upgrade our tests to run identically on both implementations. The commit looks large, but really I just indented all the existing tests, and added the pytest parameterize annotation to the class definition (and corresponding method arguments). This avoids adding a parameterize annotation to every individual test function—which wouldn’t be all that bad, but each new test would require the writer to remember to include the annotation, and this way systematically requires the extra method argument.

And finally, we can switch the database population script to use the SQL database implementation. This is done in this commit. Notice how simple it is, and how it doesn’t require any extra testing.

After running it a few times and getting a database with about 20 million rows, we can apply the simplest possible analysis: showing the top few witness values.

sqlite> select n, witness_value from RiemannDivisorSums where witness_value > 1.7 order by witness_value desc limit 100;

We can also confirm John’s claim that “the winners are all multiples of 2520,” as the best non-multiple-of-2520 up to 20 million is 18480, whose witness value is only about 1.69.

This multiple-of-2520 pattern is probably because 2520 is a highly composite number, i.e., it has more divisors than all smaller numbers, so its sum-of-divisors will tend to be large. Digging in a bit further, it seems the smallest counterexample, if it exists, is necessarily a superabundant number. Such numbers have a nice structure described here that suggests a search strategy better than trying every number.

Next time, we can introduce the concept of a search strategy as a new component to the application, and experiment with different search strategies. Other paths forward include building a front-end component, and deploying the system on a server so that the database can be populated continuously.