Searching for RH Counterexamples — Deploying with Docker

We’re ironically searching for counterexamples to the Riemann Hypothesis.

In this article we’ll deploy the application on a server, so that it can search for RH counterexamples even when I close my laptop.

Servers and containers

When deploying applications to servers, reproducibility is crucial. You don’t want your application to depend on the details of the computer it’s running on. This is a higher-level version of the same principle behind Python virtual environments, but it applies to collections of programs, possibly written in different languages and running on different computers. In our case, we have a postgres database, the pgmp extension, the populate_database program, and plans for a web server.

The principle of an application not depending on the system it’s running on is called hermeticity (the noun form of hermetic, meaning air-tight). Hermeticity is good for the following reasons. When a server crashes, you don’t have to remember what you did to set it up. Instead you run a build/install that works on any machine. Newcomers also don’t have to guess what unknown aspects of a running server are sensitive. Another benefit is that you can test it on your local machine identically to how it will run in production. It also allows you to easily migrate from one cloud provider to another, which allows you to defer expensive commitments until you have more information about your application’s needs. Finally, if you have multiple applications running on the same server, you don’t want to have their needs conflict with each other, which can happen easily if two applications have dependencies that transitively depend on different versions of the same software. This is called “dependency hell.” In all of these, you protect yourself from becoming dependent on arbitrary choices you made before you knew better.

One industry-strength approach to hermeticity is to use containers. A container is a virtual machine devoted to running a single program, with explicitly-defined exposure to the outside world. We will set up three containers: one to run the database, one for the search application, and (later) one for a web server. We’ll start by deploying them all on the same machine, but could also deploy them to different machines. Docker is a popular containerization system. Before going on, I should stress that Docker, while I like it, is not sacred by any means. In a decade Docker may disappear, but the principle of hermeticity and the need for reproducible deployments will persist.

Docker allows you to describe your container by first starting from an existing (trusted) container, such as one that has an operating system and postgres already installed, and extend it for your application. This includes installing dependencies, fetching the application code from git, copying files into the container from the host system, exposing the container’s network ports, and launching the application. You save the commands that accomplish that in a Dockerfile with some special syntax. To deploy it, you copy the Dockerfile to the server (say, via git) and run docker commands to launch the container. You only have to get the Dockerfile right once, you can test it locally, and then it will work on any server just the same. The only caveat I’ve seen here is that if you migrate to a server with a different processor architecture, the install script (in our case, pip install numba) may fail to find a pre-compiled binary for the target architecture, and it may fall back to compiling from source, which can add additional requirements or force you to change which OS your container is derived from.

This reduces our “set up a new server” script to just a few operations: (1) install docker (2) fetch the repository (3) launch the docker containers from their respective Dockerfiles. In my experience, writing a Dockerfile is no small task, but figuring out how to install stuff is awful in all cases, and doing it for Docker gives you an artifact tracing the steps, and a reasonable expectation of not having to do it again.

Thankfully, you dear readers can skip my head-banging and see the Dockerfiles after I figured it out.

The Postgres Dockerfile

This commit adds a Dockerfile for the database, and makes some small changes to the project to allow it to run. It has only 15 lines, but it took me a few hours to figure out. The process was similar to installing confusing software on your own machine: try to install, see some error like "missing postgres.h“, go hunt around on the internet to figure out what you have to install to get past the error, and repeat.

Let’s go through each line of the Dockerfile.

FROM postgres:12


The first line defines the container image that this container starts from, which is officially maintained by the Postgres team. Looking at their Dockerfile, it starts from debian:buster-slim, which is a Debian Linux instance that is “slimmed” down to be suitable for docker containers, meaning it has few packages pre-installed. Most importantly, “Debian” tells us what package manager to use (apt-get) in our Dockerfile.

It’s also worth noting at this point that, when docker builds the container, each command in a docker file results in a new image. An image is a serialized copy of all the data in a docker container, so that it can be started or extended easily. And if you change a line halfway through your Dockerfile, docker only has to rebuild images from that step onward. You can publish images on the web, and other docker users can use them as a base. This is like forking a project on Github, and is exactly what happens when Docker executes FROM postgres:12.

ENV POSTGRES_USER docker
ENV POSTGRES_DB divisor


These lines declare configuration for the database that the base postgres image will create when the container is started. The variable names are described in the “Environment Variables” section of the Postgres image’s documentation. The ENV command tells docker to instantiate environment variables (like the PATH variable in a terminal shell), that running programs can access. I’m insecurely showing the password and username here because the server the docker containers will run on won’t yet expose anything to the outside world. Later in this post you will see how to pass an environment variable from the docker command line when the container is run, and you would use something close to that to set configuration secrets securely.

RUN apt-get update \
&& apt-get install -y pgxnclient build-essential libgmp3-dev postgresql-server-dev-12 libmpc-dev


The RUN command allows you to run any shell command you’d like, in this case a command to update apt and install the dependencies needed to build the pgmp extension. This includes gcc and make via build-essential, and the gmp-specific libraries.

RUN apt-get install -y python3.7 python3-setuptools python3-pip python-pip python3.7-dev \
&& pip3 install wheel \
&& pip install six


Next we do something a bit strange. We install python3.7 and pip (because we will need to pip3 install our project’s requirements.txt), but also python2’s pip. Here’s what’s going on. The pgmp postgres extension needs to be built from source, and it has a dependency on python2.7 and the python2-six library. So the first RUN line here installs all the python-related tools we need.

RUN pgxn install pgmp


Then we install the pgmp extension.

COPY . /divisor
WORKDIR "/divisor"


These next two lines copy the current directory on the host machine to the container’s file system, and sets the working directory for all future commands to that directory. Note that whenever the contents of our project change, docker needs to rebuild the image from this step because any subsequent steps like pip install -r requirements.txt might have a different outcome.

RUN python3 -m pip install --upgrade pip
RUN pip3 install -r requirements.txt


Next we upgrade pip (which is oddly required for the numba dependency, though I can’t re-find the Github issue where I discovered this) and install the python dependencies for the project. The only reason this is required is because we included the database schema setup in the python script riemann/postgres_batabase.py. So this makes the container a bit more complicated than absolutely necessary. It can be improved later if need be.

ENV PGUSER=docker
ENV PGDATABASE=divisor


These next lines are environment variables used by the psycopg2 python library to infer how to connect to postgres if no database spec is passed in. It would be nice if this was shared with the postgres environment variables, but duplicating it is no problem.

COPY setup_schema.sh /docker-entrypoint-initdb.d/


The last line copies a script to a special directory specified by the base postgres Dockerfile. The base dockerfile specifies that any scripts in this directory will be run when the container is started up. In our case, we just call the (idempotent) command to create the database. In a normal container we might specify a command to run when the container is started (our search container, defined next, will do this), but the postgres base image handles this for us by starting the postgres database and exposing the right ports.

Finally we can build and run the container

docker build -t divisordb -f divisordb.Dockerfile .
#  ... lots of output ...

docker run -d -p 5432:5432 --name divisordb divisordb:latest


After the docker build command—which will take a while—you will be able to see the built images by running docker images, and the final image will have a special tag divisordb. The run command additionally tells docker to run the container as a daemon (a.k.a. in the background) with -d and to -p to publish port 5432 on the host machine and map it to 5432 on the container. This allows external programs and programs on other computers to talk to the container by hitting 0.0.0.0:5432. It also allows other containers to talk to this container, but as we’ll see shortly that requires a bit more work, because inside a container 0.0.0.0 means the container, not the host machine.

Finally, one can run the following code on the host machine to check that the database is accepting connections.

pg_isready --host 0.0.0.0 --username docker --port 5432 --dbname divisor


If you want to get into the database to run queries, you can run psql with the same flags as pg_isready, or manually enter the container with docker exec -it divisordb bash and run psql from there.

psql --host 0.0.0.0 --username docker --port 5432 --dbname divisor
divisor=# \d
List of relations
Schema |        Name        | Type  | Owner
--------+--------------------+-------+--------
public | riemanndivisorsums | table | docker
public | searchmetadata     | table | docker
(2 rows)


Look at that. You wanted to disprove the Riemann Hypothesis, and here you are running docker containers.

The Search Container

Next we’ll add a container for the main search application. Before we do this, it will help to make the main entry point to the program a little bit simpler. This commit modifies populate_database.py‘s main routine to use argparse and some sensible defaults. Now we can run the application with just python -m riemann.populate_database.

Then the Dockerfile for the search part is defined in this commit. I’ve copied it below. It’s much simpler than the database, but somehow took just as long for me to build as the database Dockerfile, because I originally chose a base image called “alpine” that is (unknown to me at the time) really bad for Python if your dependencies have compiled C code, like numba does.

FROM python:3.7-slim-buster

RUN apt-get update \
&& apt-get install -y build-essential libgmp3-dev libmpc-dev

COPY . /divisor
WORKDIR "/divisor"

RUN pip3 install -r requirements.txt

ENV PGUSER=docker
ENV PGDATABASE=divisor

ENTRYPOINT ["python3", "-m", "riemann.populate_database"]


The base image is again Debian, with Python3.7 pre-installed.

Then we can build it and (almost) run it

docker build -t divisorsearch -f divisorsearch.Dockerfile .
docker run -d --name divisorsearch --env PGHOST="$PGHOST" divisorsearch:latest  What’s missing here is the PGHOST environment variable, which psycopg2 uses to find the database. The problem is, inside the container “localhost” and 0.0.0.0 are interpreted by the operating system to mean the container itself, not the host machine. To get around this problem, docker maintains IP addresses for each docker container, and uses those to route network requests between containers. The docker inspect command exposes information about this. Here’s a sample of the output $ docker inspect divisordb
[
{
"Id": "f731a78bde50be3de1d77ae1cff6d23c7fe21d4dbe6a82b31332c3ef3f6bbbb4",
"Path": "docker-entrypoint.sh",
"Args": [
"postgres"
],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
...
},
...
"NetworkSettings": {
...
"Ports": {
"5432/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "5432"
}
]
},
...

The part that matters for us is the ip address, and the following extracts it to the environment variable PGHOST.