Searching for RH Counterexamples — Deploying with Docker

We’re ironically searching for counterexamples to the Riemann Hypothesis.

  1. Setting up Pytest
  2. Adding a Database
  3. Search Strategies
  4. Unbounded Integers

In this article we’ll deploy the application on a server, so that it can search for RH counterexamples even when I close my laptop.

Servers and containers

When deploying applications to servers, reproducibility is crucial. You don’t want your application to depend on the details of the computer it’s running on. This is a higher-level version of the same principle behind Python virtual environments, but it applies to collections of programs, possibly written in different languages and running on different computers. In our case, we have a postgres database, the pgmp extension, the populate_database program, and plans for a web server.

The principle of an application not depending on the system it’s running on is called hermeticity (the noun form of hermetic, meaning air-tight). Hermeticity is good for the following reasons. When a server crashes, you don’t have to remember what you did to set it up. Instead you run a build/install that works on any machine. Newcomers also don’t have to guess what unknown aspects of a running server are sensitive. Another benefit is that you can test it on your local machine identically to how it will run in production. It also allows you to easily migrate from one cloud provider to another, which allows you to defer expensive commitments until you have more information about your application’s needs. Finally, if you have multiple applications running on the same server, you don’t want to have their needs conflict with each other, which can happen easily if two applications have dependencies that transitively depend on different versions of the same software. This is called “dependency hell.” In all of these, you protect yourself from becoming dependent on arbitrary choices you made before you knew better.

One industry-strength approach to hermeticity is to use containers. A container is a virtual machine devoted to running a single program, with explicitly-defined exposure to the outside world. We will set up three containers: one to run the database, one for the search application, and (later) one for a web server. We’ll start by deploying them all on the same machine, but could also deploy them to different machines. Docker is a popular containerization system. Before going on, I should stress that Docker, while I like it, is not sacred by any means. In a decade Docker may disappear, but the principle of hermeticity and the need for reproducible deployments will persist.

Docker allows you to describe your container by first starting from an existing (trusted) container, such as one that has an operating system and postgres already installed, and extend it for your application. This includes installing dependencies, fetching the application code from git, copying files into the container from the host system, exposing the container’s network ports, and launching the application. You save the commands that accomplish that in a Dockerfile with some special syntax. To deploy it, you copy the Dockerfile to the server (say, via git) and run docker commands to launch the container. You only have to get the Dockerfile right once, you can test it locally, and then it will work on any server just the same. The only caveat I’ve seen here is that if you migrate to a server with a different processor architecture, the install script (in our case, pip install numba) may fail to find a pre-compiled binary for the target architecture, and it may fall back to compiling from source, which can add additional requirements or force you to change which OS your container is derived from.

This reduces our “set up a new server” script to just a few operations: (1) install docker (2) fetch the repository (3) launch the docker containers from their respective Dockerfiles. In my experience, writing a Dockerfile is no small task, but figuring out how to install stuff is awful in all cases, and doing it for Docker gives you an artifact tracing the steps, and a reasonable expectation of not having to do it again.

Thankfully, you dear readers can skip my head-banging and see the Dockerfiles after I figured it out.

The Postgres Dockerfile

This commit adds a Dockerfile for the database, and makes some small changes to the project to allow it to run. It has only 15 lines, but it took me a few hours to figure out. The process was similar to installing confusing software on your own machine: try to install, see some error like "missing postgres.h“, go hunt around on the internet to figure out what you have to install to get past the error, and repeat.

Let’s go through each line of the Dockerfile.

FROM postgres:12

The first line defines the container image that this container starts from, which is officially maintained by the Postgres team. Looking at their Dockerfile, it starts from debian:buster-slim, which is a Debian Linux instance that is “slimmed” down to be suitable for docker containers, meaning it has few packages pre-installed. Most importantly, “Debian” tells us what package manager to use (apt-get) in our Dockerfile.

It’s also worth noting at this point that, when docker builds the container, each command in a docker file results in a new image. An image is a serialized copy of all the data in a docker container, so that it can be started or extended easily. And if you change a line halfway through your Dockerfile, docker only has to rebuild images from that step onward. You can publish images on the web, and other docker users can use them as a base. This is like forking a project on Github, and is exactly what happens when Docker executes FROM postgres:12.

ENV POSTGRES_USER docker
ENV POSTGRES_PASSWORD docker
ENV POSTGRES_DB divisor

These lines declare configuration for the database that the base postgres image will create when the container is started. The variable names are described in the “Environment Variables” section of the Postgres image’s documentation. The ENV command tells docker to instantiate environment variables (like the PATH variable in a terminal shell), that running programs can access. I’m insecurely showing the password and username here because the server the docker containers will run on won’t yet expose anything to the outside world. Later in this post you will see how to pass an environment variable from the docker command line when the container is run, and you would use something close to that to set configuration secrets securely.

RUN apt-get update \
        && apt-get install -y pgxnclient build-essential libgmp3-dev postgresql-server-dev-12 libmpc-dev

The RUN command allows you to run any shell command you’d like, in this case a command to update apt and install the dependencies needed to build the pgmp extension. This includes gcc and make via build-essential, and the gmp-specific libraries.

RUN apt-get install -y python3.7 python3-setuptools python3-pip python-pip python3.7-dev \
        && pip3 install wheel \
        && pip install six

Next we do something a bit strange. We install python3.7 and pip (because we will need to pip3 install our project’s requirements.txt), but also python2’s pip. Here’s what’s going on. The pgmp postgres extension needs to be built from source, and it has a dependency on python2.7 and the python2-six library. So the first RUN line here installs all the python-related tools we need.

RUN pgxn install pgmp

Then we install the pgmp extension.

COPY . /divisor
WORKDIR "/divisor"

These next two lines copy the current directory on the host machine to the container’s file system, and sets the working directory for all future commands to that directory. Note that whenever the contents of our project change, docker needs to rebuild the image from this step because any subsequent steps like pip install -r requirements.txt might have a different outcome.

RUN python3 -m pip install --upgrade pip
RUN pip3 install -r requirements.txt

Next we upgrade pip (which is oddly required for the numba dependency, though I can’t re-find the Github issue where I discovered this) and install the python dependencies for the project. The only reason this is required is because we included the database schema setup in the python script riemann/postgres_batabase.py. So this makes the container a bit more complicated than absolutely necessary. It can be improved later if need be.

ENV PGUSER=docker
ENV PGPASSWORD=docker
ENV PGDATABASE=divisor

These next lines are environment variables used by the psycopg2 python library to infer how to connect to postgres if no database spec is passed in. It would be nice if this was shared with the postgres environment variables, but duplicating it is no problem.

COPY setup_schema.sh /docker-entrypoint-initdb.d/

The last line copies a script to a special directory specified by the base postgres Dockerfile. The base dockerfile specifies that any scripts in this directory will be run when the container is started up. In our case, we just call the (idempotent) command to create the database. In a normal container we might specify a command to run when the container is started (our search container, defined next, will do this), but the postgres base image handles this for us by starting the postgres database and exposing the right ports.

Finally we can build and run the container

docker build -t divisordb -f divisordb.Dockerfile .
#  ... lots of output ...

docker run -d -p 5432:5432 --name divisordb divisordb:latest

After the docker build command—which will take a while—you will be able to see the built images by running docker images, and the final image will have a special tag divisordb. The run command additionally tells docker to run the container as a daemon (a.k.a. in the background) with -d and to -p to publish port 5432 on the host machine and map it to 5432 on the container. This allows external programs and programs on other computers to talk to the container by hitting 0.0.0.0:5432. It also allows other containers to talk to this container, but as we’ll see shortly that requires a bit more work, because inside a container 0.0.0.0 means the container, not the host machine.

Finally, one can run the following code on the host machine to check that the database is accepting connections.

pg_isready --host 0.0.0.0 --username docker --port 5432 --dbname divisor

If you want to get into the database to run queries, you can run psql with the same flags as pg_isready, or manually enter the container with docker exec -it divisordb bash and run psql from there.

psql --host 0.0.0.0 --username docker --port 5432 --dbname divisor
Password for user docker: docker
divisor=# \d
              List of relations
 Schema |        Name        | Type  | Owner  
--------+--------------------+-------+--------
 public | riemanndivisorsums | table | docker
 public | searchmetadata     | table | docker
(2 rows)

Look at that. You wanted to disprove the Riemann Hypothesis, and here you are running docker containers.

The Search Container

Next we’ll add a container for the main search application. Before we do this, it will help to make the main entry point to the program a little bit simpler. This commit modifies populate_database.py‘s main routine to use argparse and some sensible defaults. Now we can run the application with just python -m riemann.populate_database.

Then the Dockerfile for the search part is defined in this commit. I’ve copied it below. It’s much simpler than the database, but somehow took just as long for me to build as the database Dockerfile, because I originally chose a base image called “alpine” that is (unknown to me at the time) really bad for Python if your dependencies have compiled C code, like numba does.

FROM python:3.7-slim-buster

RUN apt-get update \
        && apt-get install -y build-essential libgmp3-dev libmpc-dev

COPY . /divisor
WORKDIR "/divisor"

RUN pip3 install -r requirements.txt

ENV PGUSER=docker
ENV PGPASSWORD=docker
ENV PGDATABASE=divisor

ENTRYPOINT ["python3", "-m", "riemann.populate_database"]

The base image is again Debian, with Python3.7 pre-installed.

Then we can build it and (almost) run it

docker build -t divisorsearch -f divisorsearch.Dockerfile .
docker run -d --name divisorsearch --env PGHOST="$PGHOST" divisorsearch:latest 

What’s missing here is the PGHOST environment variable, which psycopg2 uses to find the database. The problem is, inside the container “localhost” and 0.0.0.0 are interpreted by the operating system to mean the container itself, not the host machine. To get around this problem, docker maintains IP addresses for each docker container, and uses those to route network requests between containers. The docker inspect command exposes information about this. Here’s a sample of the output

$ docker inspect divisordb
[
    {
        "Id": "f731a78bde50be3de1d77ae1cff6d23c7fe21d4dbe6a82b31332c3ef3f6bbbb4",
        "Path": "docker-entrypoint.sh",
        "Args": [
            "postgres"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            ...
        },
        ...
        "NetworkSettings": {
            ...
            "Ports": {
                "5432/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "5432"
                    }
                ]
            },
            ...
            "IPAddress": "172.17.0.2",
            ...
        }
    }
]

The part that matters for us is the ip address, and the following extracts it to the environment variable PGHOST.

export PGHOST=$(docker inspect -f "{{ .NetworkSettings.IPAddress }}" divisordb)

Once the two containers are running—see docker ps for the running containers, docker ps -a to see any containers that were killed due to an error, and docker logs to see the container’s logged output—you can check the database to see it’s being populated.

divisor=# select * from SearchMetadata order by start_time desc limit 10;
         start_time         |          end_time          |       search_state_type       | starting_search_state | ending_search_state 
----------------------------+----------------------------+-------------------------------+-----------------------+---------------------
 2020-12-27 03:10:01.256996 | 2020-12-27 03:10:03.594773 | SuperabundantEnumerationIndex | 29,1541               | 31,1372
 2020-12-27 03:09:59.160157 | 2020-12-27 03:10:01.253247 | SuperabundantEnumerationIndex | 26,705                | 29,1541
 2020-12-27 03:09:52.035991 | 2020-12-27 03:09:59.156464 | SuperabundantEnumerationIndex | 1,0                   | 26,705

Ship it!

I have an AWS account, so let’s use Amazon for this. Rather than try the newfangled beanstalks or lightsails or whatever AWS-specific frameworks they’re trying to sell, for now I’ll provision a single Ubuntu EC2 server and run everything on it. I picked a t2.micro for testing (which is free). There’s a bit of setup to configure and launch the server—such as picking the server image, downloading an ssh key, and finding the IP address. I’ll skip those details since they are not (yet) relevant to the engineering process.

Once I have my server, I can ssh in, install docker, git clone the project, and run the deploy script.

# install docker, see get.docker.com
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker ubuntu

# log out and log back in

git clone https://github.com/j2kun/riemann-divisor-sum && cd riemann-divisor-sum
bash deploy.sh

And it works!

Sadly, within an hour the divisorsearch container crashes because the instance runs out of RAM and CPU. Upgrading to a t2.medium (4 GiB RAM), it goes for about 2 hours before exhausting RAM. We could profile it and find the memory hotspots, but instead let’s apply a theorem due to billionaire mathematician Jim Simons: throw money at the problem. Upgrading to a r5.large (16 GiB RAM), and it runs comfortably all day.

Four days later, logging back into the VM and and I notice things are sluggish, even though the docker instance isn’t exhausting the total available RAM or CPU. docker stats also shows low CPU usage on divisorsearch. The database shows that it has only got up to 75 divisors, which is just as far as it got when I ran it (not in Docker) on my laptop for a few hours in the last article.

Something is amiss, and we’ll explore what happened next time.

Notes

A few notes on improvements that didn’t make it into this article.

In our deployment, we rebuild the docker containers each time, even when nothing changes. What one could do instead is store the built images in what’s called a container registry, and pull them instead of re-building them on every deploy. This would only save us a few minutes of waiting, but is generally good practice.

We could also use docker compose and a corresponding configuration file to coordinate launching a collection of containers that have dependencies on each other. For our case, the divisorsearch container depended on the divisordb container, and our startup script added a sleep 5 to ensure the latter was running before starting the former. docker compose would automatically handle that, as well as the configuration for naming, resource limits, etc. With only two containers it’s not that much more convenient, given that docker compose is an extra layer of indirection to learn that hides the lower-level commands.

In this article we deployed a single database container and a single “search” container. Most of the time the database container is sitting idle while the search container does its magic. If we wanted to scale up, an obvious way would be to have multiple workers. But it would require some decent feature work. A sketch: reorganize the SearchMetadata table so that it contains a state attribute, like “not started”, “started”, or “finished,” then add functionality so that a worker (atomically) asks for the oldest “not started” block and updates the row’s state to “started.” When a worker finishes a block, it updates the database and marks the block as finished. If no “not started” blocks are found, the worker proceeds to create some number of new “not started” blocks. There are details to be ironed out around race conditions between multiple workers, but Postgres is designed to make such things straightforward.

Finally, we could reduce the database size by keeping track of a summary of a search block instead of storing all the data in the block. For example, we could record the n and witness_value corresponding to the largest witness_value in a block, instead of saving every n and every witness_value. In order for this to be usable—i.e., for us to be able to say “we checked all possible n < M and found no counterexamples”—we’d want to provide a means to verify the approach, say, by randomly verifying the claimed maxima of a random subset of blocks. However, doing this also precludes the ability to analyze patterns that look at all the data. So it’s a tradeoff.

Optimization Models for Subset Cover

In a recent newsletter article I complained about how researchers mislead about the applicability of their work. I gave SAT solvers as an example. People provided interesting examples in response, but what was new to me was the concept of SMT (Satisfiability Modulo Theories), an extension to SAT. SMT seems to have more practical uses than vanilla SAT (see the newsletter for details).

I wanted to take some time to explore SMT solvers, and I landed on Z3, an open-source SMT solver from Microsoft. In particular, I wanted to compare it to ILP (Integer Linear Programing) solvers, which I know relatively well. I picked a problem that I thought would work better for SAT-ish solvers than for ILPs: subset covering (explained in the next section). If ILP still wins against Z3, then that would be not so great for the claim that SMT is a production strength solver.

All the code used for this post is on Github.

Subset covering

A subset covering is a kind of combinatorial design, which can be explained in terms of magic rings.

An adventurer stumbles upon a chest full of magic rings. Each ring has a magical property, but some pairs of rings, when worn together on the same hand, produce a combined special magical effect distinct to that pair.

The adventurer would like to try all pairs of rings to catalogue the magical interactions. With only five fingers, how can we minimize the time spent trying on rings?

Mathematically, the rings can be described as a set X of size n. We want to choose a family F of subsets of X, with each subset having size 5 (five fingers), such that each subset of X of size 2 (pairs of rings) is contained in some subset of F. And we want F to be as small as possible.

Subset covering is not a “production worthy” problem. Rather, I could imagine it’s useful in some production settings, but I haven’t heard of one where it is actually used. I can imagine, for instance, that a cluster of machines has some bug occurring seemingly at random for some point-to-point RPCs, and in tracking down the problem, you want to deploy a test change to subsets of servers to observe the bug occurring. Something like an experiment design problem.

If you generalize the “5” in “5 fingers” to an arbitrary positive integer k, and the “2” in “2 rings” to l < k, then we have the general subset covering problem. Define M(n, k, l) to be the minimal number of subsets of size k needed to cover all subsets of size l. This problem was studied by Erdős, with a conjecture subsequently proved by Vojtěch Rödl, that asymptotically M(n,k,l) grows like \binom{n}{l} / \binom{k}{l}. Additional work by Joel Spencer showed that a greedy algorithm is essentially optimal.

However, all of the constructive algorithms in these proofs involve enumerating all \binom{n}{k} subsets of X. This wouldn’t scale very well. You can alternatively try a “random” method, incurring a typically \log(r) factor of additional sets required to cover a 1 - 1/r fraction of the needed subsets. This is practical, but imperfect.

To the best of my knowledge, there is no exact algorithm, that both achieves the minimum and is efficient in avoiding constructing all \binom{n}{k} subsets. So let’s try using an SMT solver. I’ll be using the Python library for Z3.

Baseline: brute force Z3

For a baseline, let’s start with a simple Z3 model that enumerates all the possible subsets that could be chosen. This leads to an exceedingly simple model to compare the complex models against.

Define boolean variables \textup{Choice}_S which is true if and only if the subset S is chosen (I call this a “choice set”). Define boolean variables \textup{Hit}_T which is true if the subset T (I call this a “hit set”) is contained in a chosen choice set. Then the subset cover problem can be defined by two sets of implications.

First, if \textup{Choice}_S is true, then so must all \textup{Hit}_T for T \subset S. E.g., for S = \{ 1, 2, 3 \} and l=2, we get

\displaystyle \begin{aligned} \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(1,2)} \\ \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(1,3)} \\ \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(2,3)} \end{aligned}

In Python this looks like the following (note this program has some previously created lookups and data structures containing the variables)

for choice_set in choice_sets:
  for hit_set_key in combinations(choice_set.elements, hit_set_size):
    hit_set = hit_set_lookup[hit_set_key]
    implications.append(
      z3.Implies(choice_set.variable, hit_set.variable))

Second, if \textup{Hit}_T is true, it must be that some \textup{Choice}_S is true for some S containing T as a subset. For example,

\displaystyle \begin{aligned} \textup{Hit}_{(1,2)} &\implies \\ & \textup{Choice}_{(1,2,3,4)} \textup{ OR} \\ & \textup{Choice}_{(1,2,3,5)} \textup{ OR} \\ & \textup{Choice}_{(1,2,4,5)} \textup{ OR } \cdots \\ \end{aligned}

In code,

for hit_set in hit_sets.values():
  relevant_choice_set_vars = [
    choice_set.variable
    for choice_set in hit_set_to_choice_set_lookup[hit_set]
  ]
  implications.append(
    z3.Implies(
      hit_set.variable, 
      z3.Or(*relevant_choice_set_vars)))

Next, in this experiment we’re allowing the caller to specify the number of choice sets to try, and the solver should either return SAT or UNSAT. From that, we can use a binary search to find the optimal number of sets to pick. Thus, we have to limit the number of \textup{Choice}_S that are allowed to be true and false. Z3 supports boolean cardinality constraints, apparently with a special solver to handle problems that have them. Otherwise, the process of encoding cardinality constraints as SAT formulas is not trivial (and the subject of active research). But the code is simple enough:

args = [cs.variable for cs in choice_sets] + [parameters.num_choice_sets]
choice_sets_at_most = z3.AtMost(*args)
choice_sets_at_least = z3.AtLeast(*args)

Finally, we must assert that every \textup{Hit}_T is true.

solver = z3.Solver()
for hit_set in hit_sets.values():
  solver.add(hit_set.variable)

for impl in implications:
  solver.add(impl)

solver.add(choice_sets_at_most)
solver.add(choice_sets_at_least)

Running it for n=7, k=3, l=2, and seven choice sets (which is optimal), we get

>>> SubsetCoverZ3BruteForce().solve(
  SubsetCoverParameters(
    num_elements=7,
    choice_set_size=3,
    hit_set_size=2,
    num_choice_sets=7))
[(0, 1, 3), (0, 2, 4), (0, 5, 6), (1, 2, 6), (1, 4, 5), (2, 3, 5), (3, 4, 6)]
SubsetCoverSolution(status=<SolveStatus.SOLVED: 1>, solve_time_seconds=0.018305063247680664)

Interestingly, Z3 refuses to solve marginally larger instances. For instance, I tried the following and Z3 times out around n=12, k=6 (about 8k choice sets):

from math import comb

for n in range(8, 16):
  k = int(n / 2)
  l = 3
  max_num_sets = int(2 * comb(n, l) / comb(k, l))
  params = SubsetCoverParameters(
    num_elements=n,
    choice_set_size=k,
    hit_set_size=l,                                   
    num_choice_sets=max_num_sets)

    print_table(
      params, 
      SubsetCoverZ3BruteForce().solve(params), 
      header=(n==8))

After taking a long time to generate the larger models, Z3 exceeds my 15 minute time limit, suggesting exponential growth:

status               solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED   0.0271              8             4                3             28
SolveStatus.SOLVED   0.0346              9             4                3             42
SolveStatus.SOLVED   0.0735              10            5                3             24
SolveStatus.SOLVED   0.1725              11            5                3             33
SolveStatus.SOLVED   386.7376            12            6                3             22
SolveStatus.UNKNOWN  900.1419            13            6                3             28
SolveStatus.UNKNOWN  900.0160            14            7                3             20
SolveStatus.UNKNOWN  900.0794            15            7                3             26

An ILP model

Next we’ll see an ILP model for the sample problem. Note there are two reasons I expect the ILP model to fall short. First, the best solver I have access to is SCIP, which, despite being quite good is, in my experience, about an order of magnitude slower than commercial alternatives like Gurobi. Second, I think this sort of problem seems to not be very well suited to ILPs. It would take quite a bit longer to explain why (maybe another post, if you’re interested), but in short well-formed ILPs have easily found feasible solutions (this one does not), and the LP-relaxation of the problem should be as tight as possible. I don’t think my formulation is very tight, but it’s possible there is a better formulation.

Anyway, the primary difference in my ILP model from brute force is that the number of choice sets is fixed in advance, and the members of the choice sets are model variables. This allows us to avoid enumerating all choice sets in the model.

In particular, \textup{Member}_{S,i} \in \{ 0, 1 \} is a binary variable that is 1 if and only if element i is assigned to be in set S. And \textup{IsHit}_{T, S} \in \{0, 1\} is 1 if and only if the hit set T is a subset of S. Here “S” is an index over the subsets, rather than the set itself, because we don’t know what elements are in S while building the model.

For the constraints, each choice set S must have size k:

\displaystyle \sum_{i \in X} \textup{Member}_{S, i} = k

Each hit set T must be hit by at least one choice set:

\displaystyle \sum_{S} \textup{IsHit}_{T, S} \geq 1

Now the tricky constraint. If a hit set T is hit by a specific choice set S (i.e., \textup{IsHit}_{T, S} = 1) then all the elements in T must also be members of S.

\displaystyle \sum_{i \in T} \textup{Member}_{S, i} \geq l \cdot \textup{IsHit}_{T, S}

This one works by the fact that the left-hand side (LHS) is bounded from below by 0 and bounded from above by l = |T|. Then \textup{IsHit}_{T, S} acts as a switch. If it is 0, then the constraint is vacuous since the LHS is always non-negative. If \textup{IsHit}_{T, S} = 1, then the right-hand side (RHS) is l = |T| and this forces all variables on the LHS to be 1 to achieve it.

Because we fixed the number of choice sets as a parameter, the objective is 1, and all we’re doing is looking for a feasible solution. The full code is here.

On the same simple example as the brute force

>>> SubsetCoverILP().solve(
  SubsetCoverParameters(
    num_elements=7,
    choice_set_size=3,
    hit_set_size=2,
    num_choice_sets=7))
[(0, 1, 3), (0, 2, 6), (0, 4, 5), (1, 2, 4), (1, 5, 6), (2, 3, 5), (3, 4, 6)]
SubsetCoverSolution(status=<SolveStatus.SOLVED: 1>, solve_time_seconds=0.1065816879272461)

It finds the same solution in about 10x the runtime as the brute force Z3 model, though still well under one second.

On the “scaling” example, it fares much worse. With a timeout of 15 minutes, it solves n=8, decently fast, n=9,12 slowly, and times out on the rest.

status               solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED   1.9969              8             4                3             28
SolveStatus.SOLVED   306.4089            9             4                3             42
SolveStatus.UNKNOWN  899.8842            10            5                3             24
SolveStatus.UNKNOWN  899.4849            11            5                3             33
SolveStatus.SOLVED   406.9502            12            6                3             22
SolveStatus.UNKNOWN  902.7807            13            6                3             28
SolveStatus.UNKNOWN  900.0826            14            7                3             20
SolveStatus.UNKNOWN  900.0731            15            7                3             26

A Z3 Boolean Cardinality Model

The next model uses Z3. It keeps the concept of Member and Hit variables, but they are boolean instead of integer. It also replaces the linear constraints with implications. The constraint that forces a Hit set’s variable to be true when some Choice set contains all its elements is (for each S, T)

\displaystyle \left ( \bigwedge_{i \in T} \textup{Member}_{S, i} \right ) \implies \textup{IsHit}_T

Conversely, A Hit set’s variable being true implies its members are in some choice set.

\displaystyle \textup{IsHit}_T \implies \bigvee_{S} \bigwedge_{i \in T} \textup{Member}_{S, i}

Finally, we again use boolean cardinality constraints AtMost and AtLeast so that each choice set has the right size.

The results are much better than the ILP: it solves all of the instances in under 3 seconds

status              solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED  0.0874              8             4                3             28
SolveStatus.SOLVED  0.1861              9             4                3             42
SolveStatus.SOLVED  0.1393              10            5                3             24
SolveStatus.SOLVED  0.2845              11            5                3             33
SolveStatus.SOLVED  0.2032              12            6                3             22
SolveStatus.SOLVED  1.3661              13            6                3             28
SolveStatus.SOLVED  0.8639              14            7                3             20
SolveStatus.SOLVED  2.4877              15            7                3             26

A Z3 integer model

Z3 supports implications on integer equation equalities, so we can try a model that leverages this by essentially converting the boolean model to one where the variables are 0-1 integers, and the constraints are implications on equality of integer formulas (all of the form “variable = 1”).

I expect this to perform worse than the boolean model, even though the formulation is almost identical. The details of the model are here, and it’s so similar to the boolean model above that it needs no extra explanation.

The runtime is much worse, but surprisingly it still does better than the ILP model.

status              solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED  2.1129              8             4                3             28
SolveStatus.SOLVED  14.8728             9             4                3             42
SolveStatus.SOLVED  7.6247              10            5                3             24
SolveStatus.SOLVED  25.0607             11            5                3             33
SolveStatus.SOLVED  30.5626             12            6                3             22
SolveStatus.SOLVED  63.2780             13            6                3             28
SolveStatus.SOLVED  57.0777             14            7                3             20
SolveStatus.SOLVED  394.5060            15            7                3             26

Harder instances

So far all the instances we’ve been giving the solvers are “easy” in a sense. In particular, we’ve guaranteed there’s a feasible solution, and it’s easy to find. We’re giving roughly twice as many sets as are needed. There are two ways to make this problem harder. One is to test on unsatisfiable instances, which can be harder because the solver has to prove it can’t work. Another is to test on satisfiable instances that are hard to find, such as those satisfiable instances where the true optimal number of choice sets is given as the input parameter. The hardest unsatisfiable instances are also the ones where the number of choice sets allowed is one less than optimal.

Let’s test those situations. Since M(7, 3, 2) = 7, we can try with 7 choice sets and 6 choice sets.

For 7 choice sets (the optimal value), all the solvers do relatively well

method                    status  solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SubsetCoverILP            SOLVED  0.0843              7             3                2             7
SubsetCoverZ3Integer      SOLVED  0.0938              7             3                2             7
SubsetCoverZ3BruteForce   SOLVED  0.0197              7             3                2             7
SubsetCoverZ3Cardinality  SOLVED  0.0208              7             3                2             7

For 6, the ILP struggles to prove it’s infeasible, and the others do comparatively much better (at least 17x better).

method                    status      solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SubsetCoverILP            INFEASIBLE  120.8593            7             3                2             6
SubsetCoverZ3Integer      INFEASIBLE  3.0792              7             3                2             6
SubsetCoverZ3BruteForce   INFEASIBLE  0.3384              7             3                2             6
SubsetCoverZ3Cardinality  INFEASIBLE  7.5781              7             3                2             6

This seems like hard evidence that Z3 is better than ILPs for this problem (and it is), but note that the same test on n=8 fails for all models. They can all quickly prove 8 < M(8, 3, 2) \leq 11, but time out after twenty minutes when trying to determine if M(8, 3, 2) \in \{ 9, 10 \}. Note that k=3, l=2 is the least complex choice for the other parameters, so it seems like there’s not much hope to find M(n, k, l) for any seriously large parameters, like, say, k=6.

Thoughts

These experiments suggest what SMT solvers can offer above and beyond ILP solvers. Disjunctions and implications are notoriously hard to model in an ILP. You often need to add additional special variables, or do tricks like the one I did that only work in some situations and which can mess with the efficiency of the solver. With SMT, implications are trivial to model, and natively supported by the solver.

Aside from reading everything I could find on Z3, there seems to be little advice on modeling to help the solver run faster. There is a ton of literature for this in ILP solvers, but if any readers see obvious problems with my SMT models, please chime in! I’d love to hear from you. Even without that, I am pretty impressed by how fast the solves finish for this subset cover problem (which this experiment has shown me is apparently a very hard problem).

However, there’s an elephant in the room. These models are all satisfiability/feasibility checks on a given solution. What is not tested here is optimization, in the sense of having the number of choice sets used be minimized directly by the solver. In a few experiments on even simpler models, z3 optimization is quite slow. And while I know how I’d model the ILP version of the optimization problem, given that it’s quite slow to find a feasible instance when the optimal number of sets is given as a parameter, it seems unlikely that it will be fast when asked to optimize. I will have to try that another time to be sure.

Also, I’d like to test the ILP models on Gurobi, but I don’t have a personal license. There’s also the possibility that I can come up with a much better ILP formulation, say, with a tighter LP relaxation. But these will have to wait for another time.

In the end, this experiment has given me some more food for thought, and concrete first-hand experience, on the use of SMT solvers.

Searching for RH Counterexamples — Adding a Database

In the last article we set up pytest for a simple application that computes divisor sums \sigma(n) and tries to disprove the Riemann Hypothesis. In this post we’ll show how to extend the application as we add a database dependency. The database stores the computed sums so we can analyze them after our application finishes.

As in the previous post, I’ll link to specific git commits in the final code repository to show how the project evolves. You can browse or checkout the repository at each commit to see how it works.

Interface before implementation

The approach we’ll take is one that highlights the principle of good testing and good software design: separate components by thin interfaces so that the implementations of those interfaces can change later without needing to update lots of client code.

We’ll take this to the extreme by implementing and testing the logic for our application before we ever decide what sort of database we plan to use! In other words, the choice of database will be our last choice, making it inherently flexible to change. That is, first we iron out a minimal interface that our application needs, and then choose the right database based on those needs. This is useful because software engineers often don’t understand how the choice of a dependency (especially a database dependency) will work out long term, particularly as a prototype starts to scale and hit application-specific bottlenecks. Couple this with the industry’s trend of chasing hot new fads, and eventually you realize no choice is sacred. Interface separation is the software engineer’s only defense, and their most potent tool for flexibility. As a side note, Tom Gamon summarizes this attitude well in a recent article, borrowing the analogy from a 1975 investment essay The Winner’s Game by Charles Ellis. Some of his other articles reinforce the idea that important decisions should be made as late as possible, since that is the only time you know enough to make those decisions well.

Our application has two parts so far: adding new divisor sums to the database, and loading divisor sums for analysis. Since we’ll be adding to this database over time, it may also be prudent to summarize the contents of the database, e.g. to say what’s the largest computed integer. This suggests the following first-pass interface, implemented in this commit.

class DivisorDb(ABC):
    @abstractmethod
    def load() -> List[RiemannDivisorSum]:
        '''Load the entire database.'''
        pass

    @abstractmethod
    def upsert(data: List[RiemannDivisorSum]) -> None:
        '''Insert or update data.'''
        pass

    @abstractmethod
    def summarize() -> SummaryStats:
        '''Summarize the contents of the database.'''
        pass

RiemannDivisorSum and SummaryStats are dataclasses. These are special classes that are intended to have restricted behavior: storing data and providing simple derivations on that data. For us this provides a stabler interface because the contents of the return values can change over time without interrupting other code. For example, we might want to eventually store the set of divisors alongside their sum. Compare this to returning a list or tuple, which is brittle when used with things like tuple assignment.

The other interesting tidbit about the commit is the use of abstract base classes (“ABC”, an awful name choice). Python has limited support for declaring an “interface” as many other languages do. The pythonic convention was always to use its “duck-typing” feature, which meant to just call whatever methods you want on an object, and then any object that supports has those methods can be used in that spot. The mantra was, “if it walks like a duck and talks like a duck, then it’s a duck.” However, there was no way to say “a duck is any object that has a waddle and quack method, and those are the only allowed duck functions.” As a result, I often saw folks tie their code to one particular duck implementation. That said, there were some mildly cumbersome third party libraries that enabled interface declarations. Better, recent versions of Python introduced the abstract base class as a means to enforce interfaces, and structural subtyping (typing.Protocol) to interact with type hints when subtyping directly is not feasible (e.g., when the source is in different codebases).

Moving on, we can implement an in-memory database that can be used for testing. This is done in this commit. One crucial aspect of these tests is that they do not rely on the knowledge that the in-memory database is secretly a dictionary. That is, the tests use only the DivisorDb interface and never inspect the underlying dict. This allows the same tests to run against all implementations, e.g., using pytest.parameterize. Also note it’s not thread safe or atomic, but for us this doesn’t really matter.

Injecting the Interface

With our first-pass database interface and implementation, we can write the part of the application that populates the database with data. A simple serial algorithm that computes divisor sums in batches of 100k until the user hits Ctrl-C is done in this commit.

def populate_db(db: DivisorDb, batch_size: int = 100000) -> None:
    '''Populate the db in batches.'''
    starting_n = (db.summarize().largest_computed_n or 5040) + 1
    while True:
        ending_n = starting_n + batch_size
        db.upsert(compute_riemann_divisor_sums(starting_n, ending_n))
        starting_n = ending_n + 1

I only tested this code manually. The reason is that line 13 (highlighted in the abridged snippet above) is the only significant behavior not already covered by the InMemoryDivisorDb tests. (Of course, that line had a bug later fixed in this commit). I’m also expecting to change it soon, and spending time testing vs implementing features is a tradeoff that should not always fall on the side of testing.

Next let’s swap in a SQL database. We’ll add sqlite3, which comes prepackaged with python, so needs no dependency management. The implementation in this commit uses the same interface as the in-memory database, but the implementation is full of SQL queries. With this, we can upgrade our tests to run identically on both implementations. The commit looks large, but really I just indented all the existing tests, and added the pytest parameterize annotation to the class definition (and corresponding method arguments). This avoids adding a parameterize annotation to every individual test function—which wouldn’t be all that bad, but each new test would require the writer to remember to include the annotation, and this way systematically requires the extra method argument.

And finally, we can switch the database population script to use the SQL database implementation. This is done in this commit. Notice how simple it is, and how it doesn’t require any extra testing.

After running it a few times and getting a database with about 20 million rows, we can apply the simplest possible analysis: showing the top few witness values.

sqlite> select n, witness_value from RiemannDivisorSums where witness_value > 1.7 order by witness_value desc limit 100;
10080|1.7558143389253
55440|1.75124651488749
27720|1.74253672381383
7560|1.73991651920276
15120|1.73855867428903
110880|1.73484901030336
720720|1.73306535623807
1441440|1.72774021157846
166320|1.7269287425473
2162160|1.72557022852613
4324320|1.72354665986337
65520|1.71788900114772
3603600|1.71646721405987
332640|1.71609697536058
10810800|1.71607328780293
7207200|1.71577914933961
30240|1.71395368739173
20160|1.71381061514181
25200|1.71248203640096
83160|1.71210965310318
360360|1.71187211014506
277200|1.71124375582698
2882880|1.7106690212765
12252240|1.70971873843453
12600|1.70953565488377
8648640|1.70941081706371
32760|1.708296575835
221760|1.70824623791406
14414400|1.70288499724944
131040|1.70269370474016
554400|1.70259313608473
1081080|1.70080265951221

We can also confirm John’s claim that “the winners are all multiples of 2520,” as the best non-multiple-of-2520 up to 20 million is 18480, whose witness value is only about 1.69.

This multiple-of-2520 pattern is probably because 2520 is a highly composite number, i.e., it has more divisors than all smaller numbers, so its sum-of-divisors will tend to be large. Digging in a bit further, it seems the smallest counterexample, if it exists, is necessarily a superabundant number. Such numbers have a nice structure described here that suggests a search strategy better than trying every number.

Next time, we can introduce the concept of a search strategy as a new component to the application, and experiment with different search strategies. Other paths forward include building a front-end component, and deploying the system on a server so that the database can be populated continuously.

Searching for RH Counterexamples — Setting up Pytest

Some mathy-programmy people tell me they want to test their code, but struggle to get set up with a testing framework. I suspect it’s due to a mix of:

  • There are too many choices with a blank slate.
  • Making slightly wrong choices early on causes things to fail in unexpected ways.

I suspect the same concerns apply to general project organization and architecture. Because Python is popular for mathy-programmies, I’ll build a Python project that shows how I organize my projects and and test my code, and how that shapes the design and evolution of my software. I will use Python 3.8 and pytest, and you can find the final code on Github.

For this project, we’ll take advice from John Baez and explore a question that glibly aims to disprove the Riemann Hypothesis:

A CHALLENGE:

Let σ(n) be the sum of divisors of n. There are infinitely many n with σ(n)/(n ln(ln(n)) > 1.781. Can you find one? If you can find n > 5040 with σ(n)/(n ln(ln(n)) > 1.782, you’ll have disproved the Riemann Hypothesis.

I don’t expect you can disprove the Riemann Hypothesis this way, but I’d like to see numbers that make σ(n)/(n ln(ln(n)) big. It seems the winners are all multiples of 2520, so try those. The best one between 5040 and a million is n = 10080, which only gives 1.755814.

https://twitter.com/johncarlosbaez/status/1149700802371608576

Initializing the Project

One of the hardest parts of software is setting up your coding environment. If you use an integrated development environment (IDE), project setup is bespoke to each IDE. I dislike this approach, because what you learn when using the IDE is not useful outside the IDE. When I first learned to program (Java), I was shackled to Eclipse for years because I didn’t know how to compile and run Java programs without it. Instead, we’ll do everything from scratch, using only the terminal/shell and standard Python tools. I will also ignore random extra steps and minutiae I’ve built up over the years to deal with minor issues. If you’re interested in that and why I do them, leave a comment and I might follow up with a second article.

This article assumes you are familiar with the basics of Python syntax, and know how to open a terminal and enter basic commands (like ls, cd, mkdir, rm). Along the way, I will link to specific git commits that show the changes, so that you can see how the project unfolds with each twist and turn.

I’ll start by creating a fresh Python project that does nothing. We set up the base directory riemann-divisor-sum, initialize git, create a readme, and track it in git (git add + git commit).

mkdir riemann-divisor-sum
cd riemann-divisor-sum
git init .
echo "# Divisor Sums for the Riemann Hypothesis" > README.md
git add README.md
git commit -m "add empty README.md"

Next I create a Github project at https://github.com/j2kun/riemann-divisor-sum (the name riemann-divisor-sum does not need to be the same, but I think it’s good), and push the project up to Github.

git remote add origin git@github.com:j2kun/riemann-divisor-sum.git
# instead of "master", my default branch is really "main"
git push -u origin master   

Note, if you’re a new Github user, the “default branch name” when creating a new project may be “master.” I like “main” because it’s shorter, clearer, and nicer. If you want to change your default branch name, you can update to git version 2.28 and add the following to your ~/.gitconfig file.

[init]
    defaultBranch = main

Here is what the project looks like on Github as of this single commit.

Pytest

Next I’ll install the pytest library which will run our project’s tests. First I’ll show what a failing test looks like, by setting up a trivial program with an un-implemented function, and a corresponding test. For ultimate simplicity, we’ll use Python’s built-in assert for the test lines. Here’s the commit.

# in the terminal
mkdir riemann
mkdir tests


# create riemann/divisor.py containing:
'''Compute the sum of divisors of a number.'''

def divisor_sum(n: int) -> int:
    raise ValueError("Not implemented.")


# create tests/divisor_test.py containing:
from riemann.divisor import divisor_sum

def test_sum_of_divisors_of_72():
    assert 195 == divisor_sum(72)

Next we install and configure Pytest. At this point, since we’re introducing a dependency, we need a project-specific place to store that dependency. All dependencies related to a project should be explicitly declared and isolated. This page helps explain why. Python’s standard tool is the virtual environment. When you “activate” the virtual environment, it temporarily (for the duration of the shell session or until you run deactivate) points all Python tools and libraries to the virtual environment.

virtualenv -p python3.8 venv
source venv/bin/activate

# shows the location of the overridden python binary path
which python
# outputs: /Users/jeremy/riemann-divisor-sum/venv/bin/python

Now we can use pip as normal and it will install to venv. To declare and isolate the dependency, we write the output of pip freeze to a file called requirements.txt, and it can be reinstalled using pip install -r requirements.txt. Try deleting your venv directory, recreating it, and reinstalling the dependencies this way.

pip install pytest
pip freeze > requirements.txt
git add requirements.txt
git commit -m "requirements: add pytest"

# example to wipe and reinstall
# deactivate
# rm -rf venv
# virtualenv -p python3.8 venv
# source venv/bin/activate
# pip install -r requirements.txt

As an aside, at this step you may notice git mentions venv is an untracked directory. You can ignore this, or add venv to a .gitignore file to tell git to ignore it, as in this commit. We will also have to configure pytest to ignore venv shortly.

When we run pytest (with no arguments) from the base directory, we see our first error:

    from riemann.divisor import divisor_sum
E   ModuleNotFoundError: No module named 'riemann'

Module import issues are a common stumbling block for new Python users. In order to make a directory into a Python module, it needs an __init__.py file, even if it’s empty. Any code in this file will be run the first time the module is imported in a Python runtime. We add one to both the code and test directories in this commit.

When we run pytest (with no arguments), it recursively searches the directory tree looking for files like *_test.py and test_*.py loads them, and treats every method inside those files that are prefixed with “test” as a test. Non-“test” methods can be defined and used as helpers to set up complex tests. Pytest then runs the tests, and reports the failures. For me this looks like

Our first test failure.

Our implementation is intentionally wrong for demonstration purposes. When a test passes, pytest will report it quietly as a “.” by default. See these docs for more info on different ways to run the pytest binary and configure its output report.

In this basic pytest setup, you can put test files wherever you want, name the files and test methods appropriately, and use assert to implement the tests themselves. As long as your modules are set up properly, as long as imports are absolute (see this page for gory details on absolute vs. relative imports), and as long as you run pytest from the base directory, pytest will find the tests and run them.

Since pytest searches all directories for tests, this includes venv and __pycache__, which magically appears when you create python modules (I add __pycache__ to gitignore). Sometimes package developers will include test code, and pytest will then run those tests, which often fail or clutter the output. A virtual environment also gets large as you install big dependencies (like numpy, scipy, pandas), so this makes pytest slow to search for tests to run. To alleviate, the --norecursedirs command line flag tells pytest to skip directories. Since it’s tedious to type --norecursedirs='venv __pycache__' every time you run pytest, you can make this the default behavior by storing the option in a configuration file recognized by pytest, such as setup.cfg. I did it in this commit.

Some other command line options that I use all the time:

  • pytest test/dir to test only files in that directory, or pytest test/dir/test_file.py to test only tests in that file.
  • pytest -k STR to only run tests whose name contains “STR”
  • pytest -s to see see any logs or print statements inside tested code
  • pytest -s to allow the pdb/ipdb debugger to function and step through a failing test.

Building up the project

Now let’s build up the project. My general flow is as follows:

  1. Decide what work to do next.
  2. Sketch out the interface for that work.
  3. Write some basic (failing, usually lightweight) tests that will pass when the work is done.
  4. Do the work.
  5. Add more nuanced tests if needed, based on what is learned during the work.
  6. Repeat until the work is done.

This strategy is sometimes called “the design recipe,” and I first heard about it from my undergraduate programming professor John Clements at Cal Poly, via the book “How to Design Programs.” Even if I don’t always use it, I find it’s a useful mental framework for getting things done.

For this project, I want to search through positive integers, and for each one I want to compute a divisor sum, do some other arithmetic, and compare that against some other number. I suspect divisor sum computations will be the hard/interesting part, but to start I will code up a slow/naive implementation with some working tests, confirm my understanding of the end-to-end problem, and then improve the pieces as needed.

In this commit, I implement the naive divisor sum code and tests. Note the commit also shows how to tell pytest to test for a raised exception. In this commit I implement the main search routine and confirm John’s claim about n=10080 (thanks for the test case!).

These tests already showcase a few testing best practices:

  • Test only one behavior at a time. Each test has exactly one assertion in it. This is good practice because when a test fails you won’t have to dig around to figure out exactly what went wrong.
  • Use the tests to help you define the interface, and then only test against that interface. The hard part about writing clean and clear software is defining clean and clear interfaces that work together well and hide details. Math does this very well, because definitions like \sigma(n) do not depend on how n is represented. In fact, math really doesn’t have “representations” of its objects—or more precisely, switching representations is basically free, so we don’t dwell on it. In software, we have to choose excruciatingly detailed representations for everything, and so we rely on the software to hide those details as much as possible. The easiest way to tell if you did it well is to try to use the interface and only the interface, and tests are an excuse to do that, which is not wasted effort by virtue of being run to check your work.

Improving Efficiency

Next, I want to confirm John’s claim that n=10080 is the best example between 5041 and a million. However, my existing code is too slow. Running the tests added in this commit seems to take forever.

We profile to confirm our suspected hotspot:

>>> import cProfile
>>> from riemann.counterexample_search import best_witness
>>> cProfile.run('best_witness(10000)')
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
54826    3.669    0.000    3.669    0.000 divisor.py:10(<genexpr>)

As expected, computing divisor sums is the bottleneck. No surprise there because it makes the search take quadratic time. Before changing the implementation, I want to add a few more tests. I copied data for the first 50 integers from OEIS and used pytest’s parameterize feature since the test bodies are all the same. This commit does it.

Now I can work on improving the runtime of the divisor sum computation step. Originally, I thought I’d have to compute the prime factorization to use this trick that exploits the multiplicativity of \sigma(n), but then I found this approach due to Euler in 1751 that provides a recursive formula for the sum and skips the prime factorization. Since we’re searching over all integers, this allows us to trade off the runtime of each \sigma(n) computation against the storage cost of past \sigma(n) computations. I tried it in this commit, using python’s built-in LRU-cache wrapper to memoize the computation. The nice thing about this is that our tests are already there, and the interface for divisor_sum doesn’t change. This is on purpose, so that the caller of divisor_sum (in this case tests, also client code in real life) need not update when we improve the implementation. I also ran into a couple of stumbling blocks implementing the algorithm (I swapped the order of the if statements here), and the tests made it clear I messed up.

However, there are two major problems with that implementation.

  1. The code is still too slow. best_witness(100000) takes about 50 seconds to run, almost all of which is in divisor_sum.
  2. Python hits its recursion depth limit, and so the client code needs to eagerly populate the divisor_sum cache, which is violates encapsulation. The caller should not know anything about the implementation, nor need to act in a specific way to accommodate hidden implementation details.

I also realized after implementing it that despite the extra storage space, the runtime is still O(n^{3/2}), because each divisor-sum call requires O(n^{1/2}) iterations of the loop. This is just as slow as a naive loop that checks divisibility of integers up to \sqrt{n}. Also, a naive loop allows me to plug in a cool project called numba that automatically speeds up simple Python code by compiling it in place. Incidentally, numba is known to not work with lru_cache, so I can’t tack it on my existing implementation.

So I added numba as a dependency and drastically simplified the implementation. Now the tests run in 8 seconds, and in a few minutes I can upgrade John’s claim that n=10080 is the best example between 5041 and a million, to the best example between 5041 and ten million.

Next up

This should get you started with a solid pytest setup for your own project, but there is a lot more to say about how to organize and run tests, what kinds of tests to write, and how that all changes as your project evolves.

For this project, we now know that the divisor-sum computation is the bottleneck. We also know that the interesting parts of this project are yet to come. We want to explore the patterns in what makes these numbers large. One way we could go about this is to split the project into two components: one that builds/manages a database of divisor sums, and another that analyzes the divisor sums in various ways. The next article will show how the database set up works. When we identify relevant patterns, we can modify the search strategy to optimize for that. As far as testing goes, this would prompt us to have an interface layer between the two systems, and to add fakes or mocks to test the components in isolation.

After that, there’s the process of automating test running, adding tests for code quality/style, computing code coverage, adding a type-hint checker test, writing tests that generate other tests, etc.

If you’re interested, let me know which topics to continue with. I do feel a bit silly putting so much pomp and circumstance around such a simple computation, but hopefully the simplicity of the core logic makes the design and testing aspects of the project clearer and easier to understand.