Regression and Linear Combinations

Recently I’ve been helping out with a linear algebra course organized by Tai-Danae Bradley and Jack Hidary, and one of the questions that came up a few times was, “why should programmers care about the concept of a linear combination?”

For those who don’t know, given vectors $v_1, \dots, v_n$, a linear combination of the vectors is a choice of some coefficients $a_i$ with which to weight the vectors in a sum $v = \sum_{i=1}^n a_i v_i$.

I must admit, math books do a poor job of painting the concept as more than theoretical—perhaps linear combinations are only needed for proofs, while the real meat is in matrix multiplication and cross products. But no, linear combinations truly lie at the heart of many practical applications.

In some cases, the entire goal of an algorithm is to find a “useful” linear combination of a set of vectors. The vectors are the building blocks (often a vector space or subspace basis), and the set of linear combinations are the legal ways to combine the blocks. Simpler blocks admit easier and more efficient algorithms, but their linear combinations are less expressive. Hence, a tradeoff.

A concrete example is regression. Most people think of regression in terms of linear regression. You’re looking for a linear function like $y = mx+b$ that approximates some data well. For multiple variables, you have, e.g., $\mathbf{x} = (x_1, x_2, x_3)$ as a vector of input variables, and $\mathbf{w} = (w_1, w_2, w_3)$ as a vector of weights, and the function is $y = \mathbf{w}^T \mathbf{x} + b$.

To avoid the shift by $b$ (which makes the function affine instead of purely linear; formulas of purely linear functions are easier to work with because the shift is like a pesky special case you have to constantly account for), authors often add a fake input variable $x_0$ which is always fixed to 1, and relabel $b$ as $w_0$ to get $y = \mathbf{w}^T \mathbf{x} = \sum_i w_i x_i$ as the final form. The optimization problem to solve becomes the following, where your data set to approximate is $\{ \mathbf{x}_1, \dots, \mathbf{x}_k \}$.

$\displaystyle \min_w \sum_{i=1}^k (y_i - \mathbf{w}^T \mathbf{x}_i)^2$

In this case, the function being learned—the output of the regression—doesn’t look like a linear combination. Technically it is, just not in an interesting way.

It becomes more obviously related to linear combinations when you try to model non-linearity. The idea is to define a class of functions called basis functions $B = \{ f_1, \dots, f_m \mid f_i: \mathbb{R}^n \to \mathbb{R} \}$, and allow your approximation to be any linear combination of functions in $B$, i.e., any function in the span of B.

$\displaystyle \hat{f}(\mathbf{x}) = \sum_{i=1}^m w_i f_i(\mathbf{x})$

Again, instead of weighting each coordinate of the input vector with a $w_i$, we’re weighting each basis function’s contribution (when given the whole input vector) to the output. If the basis functions were to output a single coordinate ($f_i(\mathbf{x}) = x_i$), we would be back to linear regression.

Then the optimization problem is to choose the weights to minimize the error of the approximation.

$\displaystyle \min_w \sum_{j=1}^k (y_j - \hat{f}(\mathbf{x}_j))^2$

As an example, let’s say that we wanted to do regression with a basis of quadratic polynomials. Our basis for three input variables might look like

$\displaystyle \{ 1, x_1, x_2, x_3, x_1x_2, x_1x_3, x_2x_3, x_1^2, x_2^2, x_3^2 \}$

Any quadratic polynomial in three variables can be written as a linear combination of these basis functions. Also note that if we treat this as the basis of a vector space, then a vector is a tuple of 10 numbers—the ten coefficients in the polynomial. It’s the same as $\mathbb{R}^{10}$, just with a different interpretation of what the vector’s entries mean. With that, we can see how we would compute dot products, projections, and other nice things, though they may not have quite the same geometric sensibility.

These are not the usual basis functions used for polynomial regression in practice (see the note at the end of this article), but we can already do some damage in writing regression algorithms.

A simple stochastic gradient descent

Although there is a closed form solution to many regression problems (including the quadratic regression problem, though with a slight twist), gradient descent is a simple enough solution to showcase how an optimization solver can find a useful linear combination. This code will be written in Python 3.9. It’s on Github.

First we start with some helpful type aliases

from typing import Callable, Tuple, List

Input = Tuple[float, float, float]
Coefficients = List[float]
Gradient = List[float]
Hypothesis = Callable[[Input], float]
Dataset = List[Tuple[Input, float]]


Then define a simple wrapper class for our basis functions

class QuadraticBasisPolynomials:
def __init__(self):
self.basis_functions = [
lambda x: 1,
lambda x: x[0],
lambda x: x[1],
lambda x: x[2],
lambda x: x[0] * x[1],
lambda x: x[0] * x[2],
lambda x: x[1] * x[2],
lambda x: x[0] * x[0],
lambda x: x[1] * x[1],
lambda x: x[2] * x[2],
]

def __getitem__(self, index):
return self.basis_functions[index]

def __len__(self):
return len(self.basis_functions)

def linear_combination(self, weights: Coefficients) -> Hypothesis:
def combined_function(x: Input) -> float:
return sum(
w * f(x)
for (w, f) in zip(weights, self.basis_functions)
)

return combined_function

basis = QuadraticBasisPolynomials()


The linear_combination function returns a function that computes the weighted sum of the basis functions. Now we can define the error on a dataset, as well as for a single point

def total_error(weights: Coefficients, data: Dataset) -> float:
hypothesis = basis.linear_combination(weights)
return sum(
(actual_output - hypothesis(example)) ** 2
for (example, actual_output) in data
)

def single_point_error(
weights: Coefficients, point: Tuple[Input, float]) -> float:
return point[1] - basis.linear_combination(weights)(point[0])


We can then define the gradient of the error function with respect to the weights and a single data point. Recall, the error function is defined as

$\displaystyle E(\mathbf{w}) = \sum_{j=1}^k (y_j - \hat{f}(\mathbf{x}_j))^2$

where $\hat{f}$ is a linear combination of basis functions

$\hat{f}(\mathbf{x}_j) = \sum_{s=1}^n w_s f_s(\mathbf{x}_j)$

Since we’ll do stochastic gradient descent, the error formula is a bit simpler. We compute it not for the whole data set but only a single random point at a time. So the error is

$\displaystyle E(\mathbf{w}) = (y_j - \hat{f}(\mathbf{x}_j))^2$

Then we compute the gradient with respect to the individual entries of $\mathbf{w}$, using the chain rule and noting that the only term of the linear combination that has a nonzero contribution to the gradient for $\frac{\partial E}{\partial w_i}$ is the term containing $w_i$. This is one of the major benefits of using linear combinations: the gradient computation is easy.

$\displaystyle \frac{\partial E}{\partial w_i} = -2 (y_j - \hat{f}(\mathbf{x}_j)) \frac{\partial \hat{f}}{\partial w_i}(\mathbf{x}_j) = -2 (y_j - \hat{f}(\mathbf{x}_j)) f_i(\mathbf{x}_j)$

Another advantage to being linear is that this formula is agnostic to the content of the underlying basis functions. This will hold so long as the weights don’t show up in the formula for the basis functions. As an exercise: try changing the implementation to use radial basis functions around each data point. (see the note at the end for why this would be problematic in real life)

def gradient(weights: Coefficients, data_point: Tuple[Input, float]) -> Gradient:
error = single_point_error(weights, data_point)
dE_dw = [0] * len(weights)

for i, w in enumerate(weights):
dE_dw[i] = -2 * error * basis[i](data_point[0])

return dE_dw


Finally, the gradient descent core with a debugging helper.

import random

def print_debug_info(step, grad_norm, error, progress):
print(f"{step}, {progress:.4f}, {error:.4f}, {grad_norm:.4f}")

def gradient_descent(
data: Dataset,
learning_rate: float,
tolerance: float,
training_callback = None,
) -> Hypothesis:
weights = [random.random() * 2 - 1 for i in range(len(basis))]
last_error = total_error(weights, data)
step = 0
progress = tolerance * 2
grad_norm = 1.0

if training_callback:
training_callback(step, 0.0, last_error, 0.0)

while abs(progress) > tolerance or grad_norm > tolerance:
grad = gradient(weights, random.choice(data))
grad_norm = sum(x**2 for x in grad)

for i in range(len(weights)):
weights[i] -= learning_rate * grad[i]

error = total_error(weights, data)
progress = error - last_error
last_error = error
step += 1

if training_callback:
training_callback(step, grad_norm, error, progress)

return basis.linear_combination(weights)


Next create some sample data and run the optimization

def example_quadratic_data(num_points: int):
def fn(x, y, z):
return 2 - 4*x*y + z + z**2

data = []
for i in range(num_points):
x, y, z = random.random(), random.random(), random.random()
data.append(((x, y, z), fn(x, y, z)))

return data

if __name__ == "__main__":
data = example_quadratic_data(30)
gradient_descent(
data,
learning_rate=0.01,
tolerance=1e-06,
training_callback=print_debug_info
)


Depending on the randomness, it may take a few thousand steps, but it typically converges to an error of < 1. Here’s the plot of error against gradient descent steps.

Kernels and Regularization

I’ll finish with explanations of the parentheticals above.

The real polynomial kernel. We chose a simple set of polynomial functions. This is closely related to the concept of a “kernel”, but the “real” polynomial kernel uses slightly different basis functions. It scales some of the basis functions by $\sqrt{2}$. This is OK because a linear combination can compensate by using coefficients that are appropriately divided by $\sqrt{2}$. But why would one want to do this? The answer boils down to a computational efficiency technique called the “Kernel trick.” In short, it allows you to compute the dot product between two linear combinations of vectors in this vector space without explicitly representing the vectors in the space to begin with. If your regression algorithm uses only dot products in its code (as is true of the closed form solution for regression), you get the benefits of nonlinear feature modeling without the cost of computing the features directly. There’s a lot more mathematical theory to discuss here (cf. Reproducing Kernel Hilbert Space) but I’ll have to leave it there for now.

What’s wrong with the radial basis function exercise? This exercise asked you to create a family of basis functions, one for each data point. The problem here is that having so many basis functions makes the linear combination space too expressive. The optimization will overfit the data. It’s like a lookup table: there’s one entry dedicated to each data point. New data points not in the training would be rarely handled well, since they aren’t in the “lookup table” the optimization algorithm found. To get around this, in practice one would add an extra term to the error corresponding to the L1 or L2 norm of the weight vector. This allows one to ensure that the total size of the weights is small, and in the L1 case that usually corresponds to most weights being zero, and only a few weights (the most important) being nonzero. The process of penalizing the “magnitude” of the linear combination is called regularization.

Searching for RH Counterexamples — Productionizing

We’re ironically searching for counterexamples to the Riemann Hypothesis.

In the last article we rearchitected the application so that we could run as many search instances as we want in parallel, and speed up the application by throwing more compute resources at the problem.

This is good, but comes with a cost. The complexity of this new architecture requires us to manage many different containers and AWS instances. Let’s take a step back. In this article, we’ll focus on improving the “production worthiness” of the application. In particular, we will:

• Automate running tests on every pull request
• Add some extra error handling code
• Clean up stale blocks when worker jobs fail
• Automate checking Python type hints and test coverage
• Add static analysis checks
• Add alerting to tell us when jobs fail
• Automate the process of updating the application with new code

Automating test running

The main benefit of writing tests is that you can run them. It’s even better when the tests are run automatically on a pull request. It guards buggy changes from breaking the main code branch.

There are many systems that work with GitHub to automate running tests. For this project I used CircleCI, which has a nice free tier. “CI” stands for continuous integration, which is an idea that if you guard your main branch well enough, you can ensure that the main branch is always in a state that can be released or deployed to production servers. And if this is the case, then you might as well have the computer manage regular releases for you (provided all tests pass).

Since we don’t yet have a way to automate releases, we’ll start by running tests on every pull request. This pull request was my first failed attempt and configuring CircleCI based on their tutorial, and this commit fixed it (which was really 15 attempts squashed into one commit post hoc). Originally I thought I could use CircleCI’s built in python testing jobs, but since this project requires a compiled component—the gmp library for unbounded integer arithmetic, and the associated Postgres extension—we need to manually install some other packages. Thankfully, CircleCI uses containers, so the configuration looks a lot like a Dockerfile. Unfortunately, CircleCI’s containers use a different Ubuntu operating system for their base image, which means the packages I had to install have different names than I expected.

So after an hour or two of trial and error, now it works. When you open a new pull request you see the tests run.

And when the tests are finished and passing you see green (otherwise red).

I must admit, most of these tools (see Coveralls below) are primarily designed for Python projects without compiled components. The defaults all tend to fail and you fall back to running arbitrary shell scripts. That’s one good reason to get familiar with standard linux package managers.

Better error handling

A few times the block processing job ran into some errors, and the container died. I don’t have a means to alert when the jobs fail, so I manually checked on them every once in a while. One error was an overflow error as described in this issue. While I might want to dig in a bit more to solve that problem, the effect was that the job died and the block was left in the IN_PROGRESS state. I want errors to result in marking the block as failed, engaging the retry loop, to exercise the possibility that the issue is somehow transient or specific to one broken worker job.

It’s relatively simple, and this pull request does the trick. This commit is the main part.

I added a nice test that demonstrates (again) the benefit of having interfaces. To test this behavior, we need to inject a fake error into the search_strategy.process_block function. One way to do that is to use mocks, and “monkey patch” the process_block method to operate differently during that test. In my opinion, this is bad design, because those tests can quickly become stale, and monkey patching results in odd behavior when you don’t set up the patch just right. Instead, writing a test-specific implementation of the search strategy interface is a safer approach. This commit does that.

Once that was working, I turned to the overflow error, which I fixed in this pull request. Turns out, the integers involved in the computation are so large that their logarithms don’t fit in a single floating point number. So I had to switch to using the log function from the gmp library, which outputs a multi-precision floating point number.

Handle stale blocks

Due to the jobs failing in the previous section, and as an extra safety measure against further failures—including when a job is killed in order to relaunch it with new code—I decided to add an extra job devoted to looking for stale blocks (which have been in_progress for more than an hour) and marking them as failed. This issue describes the idea. Once deployed, the current production system will automatically patch up its past failures. If only it were so easy for we humans. This pull request adds the new job, a new Dockerfile, and an end-to-end test. I deployed the job on the same server as the generator job, since both jobs spend most of their time sleeping.

Type checker tests

So far in the project, I’ve been adding type hints to most of the functions. Type hints are optional in Python, but if you use them, you can add an automated test to ensure that your code matches what the type hints say. This helps you have more confidence that your code is correct. There’s a lot more to say on this subject (software engineers often argue about the value of type checking), but I’ll refrain from providing my opinion for now, assume it’s worthwhile, and show what it’s like to use it.

For this project I’ll use mypy. Since I haven’t been running a type checker on this project yet, I expect the first run will find a lot of errors. When I run mypy riemann/*.py it reports 29 errors. Notably, these two errors suggest problems with our dependencies.

riemann/superabundant.py:6: error: Cannot find implementation or library stub for module named 'gmpy2'
riemann/superabundant.py:7: error: Cannot find implementation or library stub for module named 'numba'


The problem here is that if a dependency doesn’t use type hints, then the type checker won’t know what to do with functions from the dependent module. The options are to create a “stub” for those types, or to tell the type checker to ignore them. Stubbing is supposed to be easy, but for project that have compiled dependencies it can be considerably harder. For now I went with ignoring missing stubs. This pull request does the extra work to run the type checker in the automated test suite, and fix some minor type errors (one of which spotted a real bug).

Test Coverage

Test coverage (or “code coverage”) tools report which lines of your program are exercised by some test. Like type hints, requiring test coverage has advantages and disadvantages. Briefly, I feel code coverage is a good baseline understanding of what’s tested, but coverage does not imply adequate testing, nor does lack of coverage imply poor testing. Rather, it’s a rough guide that can show you glaring holes in what you thought to write tests for. There’s more to say, but not today.

Pytest has plugins for coverage measuring, and when you install them you can just run pytest --cov --cov-report html:cov_html to generate html pages that show line-by-line and file-by-file coverage stats. Mine looked like this.

A few obstacles, again coming from the fact that the project has compiled parts. In particular, Numba does not provide coverage support yet. So to get coverage statistics, you have to disable the jit compiler during the test execution. This simply requires setting the environment variable NUMBA_DISABLE_JIT=1. Disabling this made some of the tests run extremely slowly, but some minor changes restored a reasonable runtime (2 minutes, mostly end-to-end tests).

The other problem was the end to end tests. You need to add a special hook to allow the Python coverage tool to see the lines of code executed by subprocesses, which is used heavily by the end to end tests. This pull request shows the steps to overcoming both of these problems.

Also in that pull request is the configuration of Coveralls. This allows code coverage to be computed during the continuous integration test run, and shown on the pull request before merging, with the option to “fail” the code coverage test if the changes in the pull request reduce code coverage too much.

The configuration together between CircleCI and Coveralls required setting the Coveralls project token as an environment variable during CircleCI workflows, and running coveralls after the pytest command finished. Now pull requests look like this.

Static Analysis

Static analysis refers to tools that look at your code and point out problems or suggest improvements. In Python there are scant few good static analysis tools, in part because Python is designed to be very dynamic. It’s not possible for an automated system to be 100% sure what a piece of code does, because even importing a module can do weird things like change the behavior of arithmetic operations. Static analysis plummets in value as false positives increase, but being conservative in making recommendations also limits the value.

Static analysis also refers to the ability to do autocomplete, or reformatting source code to match a particular style, as well as type checking. But I just want to focus on semantic code improvements, not “prettifying” code. A simple example is dead code, which is unused by the application and can be safely deleted. Another is SQL injection vulnerabilities.

For this project I will configure lgtm.com, which is a nice general-purpose static analysis framework that hooks into any GitHub repository without configuration and supports a few important languages. Here is an example run, which resulted in 11 alerts like this one

And it’s right! I didn’t need that entire statement because the variable it defined was unused. This one was low-stakes, but another alert pointed to a more serious and subtle bug, related to how the SuperabundantSearchStrategy class mutates its internal state. I can clean that up later, but for now let’s marvel at how the static analysis tool caught the bug with no work on my part. Fantastic!

This pull request addresses the 11 alerts generated by LGTM, most of which are unused imports. LGTM is even nice enough to post comments showing the improvements by PR.

Badges

A little bit of flair: I added badges to the project README, so I could show off the fact that tests are passing and I have decent code coverage and code that passes static analysis checks.

Alerting on job failure

Next, I wanted to set up some rudimentary alerting so that I can know when my docker containers fail. If I don’t have this, I have to manually check up on them, and if they fail and I forget, I end up paying money to run the servers for nothing.

There are many approaches to do monitoring and alerting, many software packages, and many opinions. This series is supposed to show you the ideas and problems that software engineers care about, but not go overboard so that we still have time to do math. So I’ll implement what I consider the simplest solution, describe what it lacks, and briefly mention some other approaches.

We need two components: the ability to tell if a container has failed, and the ability to send an email. The former is easy, because the docker runtime provides a pleasant CLI.

docker ps -a --format="{{.Image}}" --filter="status=exited"


docker ps -a shows the status of all running containers, the --filter flag makes the command show only the containers that have exited, and the --format flag makes the command print a restricted subset of the information, in this case just the image name. This makes it suitable for providing as input to a program, since there is no special text added to the output just for human-readability.

The second part, sending an email, is more complicated because popular email services like Gmail have stringent rules. In short, Gmail blocks email from any unknown sender by default (because spam), and so in order to send an email successfully from a program, you need to jump through a bunch of hoops to authenticate the system to send on behalf of an account with a popular provider. You cant just use sendmail. Since I use Gmail, we’ll use it for authentication.

To authenticate a program to send from a Gmail account, you have to generate an “app password” for your Google account, and then configure the mail sender program to use it. The program I chose is the CLI ssmtp, and you can configure it to use your Google account and password via a configuration file typically stored at /etc/ssmtp/ssmtp.conf. The configuration file looks like this (with sensitive data censored with x’s)

root=xxxx@gmail.com
mailhub=smtp.gmail.com:465
FromLineOverride=YES
AuthUser=xxxx@gmail.com
AuthPass=xxxxxxxxxxxxx
TLS_CA_FILE=/etc/ssl/certs/ca-certificates.crt
UseTLS=Yes
rewriteDomain=gmail.com
hostname=xx.xx.xx.xx.xx


Once you have this, you can send email using a command-line invocation like

echo "Email message body goes here" | ssmtp recipient@gmail.com


However, this poses another small problem. While we could manually configure each of these config files on each of our servers, we would like to automate it eventually (next section), and automating it would seem to require us to store the sensitive information in a script that goes into our version-controlled (public!) code repository.

To deal with this, we need a simple way to manage secrets. The typical way to do this is via unix environment variables. In short, environment variables are variables and values you can store in the execution context of any script or terminal session. The most common example is $PATH, which stores a list of directories to search for programs in when you run a program. E.g., if you run docker ps, it will look to see if there’s a program called docker in /user/bin/, and if not try /usr/local/bin/, and so on until exhausting all entries in $PATH. We used environment variables to pass the postgres host ip to our docker containers, but it can also be used for passing secrets to programs to avoid storing passwords in text files that might become public accidentally.

You can set an environment variable using export VARIABLE="any value here". All values are strings. In our case, we will define two environment variables GMAIL_APP_USER—for the email address of the Google account to authenticate as—and GMAIL_APP_PASS for the app password. Then this bash script (run with sudo -E to ensure environment variables are passed from the calling context) will set up the configuration in one go, and this python script will orchestrate the loop of checking docker ps -a and sending an email if it sees any failed containers, passing the password via an environment variable as well. An improvement I discovered later allows us to also avoid storing the password in the ssmtp configuration file, and instead pass it directly from the environment to the python script.

Note this uses the python-dotenv library to easily import environment variables stored in a .env file. We must be careful not to check the .env file into version control, so we update the .gitignore and add an .env.template as a hint to show what environment variables are needed in .env.

Finally, though the deploy script is not quite useful as a standalone script (the docker containers live on different machines), it’s still useful to remind us how to manually deploy. So after exporting the right environment variables we have a few simple new steps to follow to run the monitoring script.

sudo apt install -y python3-pip ssmtp
pip3 install -r alerts/requirements.txt
sudo -E alerts/configure_ssmtp.sh
nohup python3 -m alerts.monitor_docker &


The last line may be new. nohup tells the operating system to “disconnect” the program from the shell that launched it (it’s actually a bit more specific, but has this effect), and the & has the program run in the background. This effectively allows us to launch the monitoring job in an ssh session, and then close the ssh session without terminating the program.

That raises an interesting point: how can we be sure that the monitoring program doesn’t fail? Who watches the watcher? For us, nobody. There is always more engineering you can do if you want a higher level of safety and security, but for us the returns diminish quickly. Eventually we will check the program manually and notice if it broke.

Automating deployment

The final step of our “productionization” process will be to set up some mechanism to deploy our application when we make updates. Like monitoring and alerting, there are quite a few frameworks out there. The main one for us to consider is called a container orchestration system, and the biggest one for Docker is called Kubernetes. If you read through the Kubernetes homepage, you’ll see all kinds of benefits that align with what we want: automatic deployment and container management, including monitoring, restarting jobs, and changing compute resource limits like RAM limit and CPU. These are all things which have taken up our time while building this project! AWS has support for Kubernetes, as well as their own homegrown solutions.

The difficulty around these topics is that you can spend as much (or more) of your time learning the details of the framework than you would making a functional-but-suboptimal version yourself. Such frameworks can also extra layers of frustration because of the additional jargon and learning curve, the mess of configuration, and the tendency for corporate products to change and force you to change in turn. Docker was already a decent investment, it’s asking a lot to invest more!

If I had all the time in the world, I’d go with Kubernetes. But in the next article I want to get back to doing some math and visualization with the data we’ve collected. So let’s pretend we can’t use a framework, and write a basic script that will automate our deployment for us. That is, the script will ssh into the EC2 instances that run each piece of the application, run git pull on each, run docker stop on the containers and remove them, build new container images from the new code, launch those images, and do it all in the right dependency order. The approach won’t be super resilient or have all the engineering ribbons and bows, but in the worst case we can manually fix any problems.

This script does just that. It’s simple enough for starters, and didn’t take me more than an hour to get working. We’ll leave Kubernetes for future work.

That said, in the process of building and testing the deployment script I hit a disappointing problem: docker was deleting the database when I removed the container. Or rather, it started over and I couldn’t find where the data was stored on the host machine. This appears to be because I was letting the Postgres base Dockerfile define a “volume,” which wasn’t saved when I removed the built Docker image (docker prune). So I read up about docker volumes and created a named volume and attached it to the docker container at startup time. The volume won’t disappear when the container stops now, and I can automatically back up the volume (coming soon), so that further blunders don’t result in losing all the data again.

Next time we’ll get back to the mathematical aspects and look for patterns in the data we’ve found so far. I, for one, am optimistic we’ll find the secrets we need to disprove RH.

Searching for RH Counterexamples — Scaling Up

We’re ironically searching for counterexamples to the Riemann Hypothesis.

Last time we made the audacious choice to remove primary keys from the RiemannDivisorSums table for performance reasons. To help with that, we will do two things in this post

• Reduce the storage footprint of the whole application (it was 60 GiB when it crashed, and we got up to 84 prime factors).
• Refactor the application into a worker architecture.

The first one is straightforward, but uses a possibly new idea to the mathematician, and the second is a relatively major rearchitecture.

Hashing a search block

Instead of storing every single Riemann divisor sum, we can just store the ones that are interesting, that is, the ones that are larger than some threshold—in our case, 1.767, which had just under a thousand examples in the last post.

That raises a problem. We want to be able to say that we checked all examples up to some limit. We’ve been saying “up to all numbers having at most K prime factors” because that’s how the superabundant search strategy enumerates the candidates. If we don’t store all the witness values, how can we prove to a skeptical observer that we checked what we said we checked?

One approach is to instead store a “summary string” derived from the full set of witness values. This “summary string” should be deterministically reproducible, so that if someone has the metadata about the block of numbers, they can recompute the summary and compare it with our summary.

A simple example summary might be the string-concatenation of all the witness values one after another, like 1.4323,1.5678,0.8792. But that still takes up a lot of space (each block had 250,000 different witness values). It would be better if we could find a summary that had a fixed size, regardless of how big the input is. Taking a cue from computer science, a hash function does exactly that. Hash functions aren’t perfect, they can have collisions, but the collisions are unlikely enough that, if you tested a hundred random blocks and all your hashes agreed with all of my hashes, you would be confident I checked everything the same as you. That gives you confidence that the rest of my blocks (say, a million of them) were computed correctly.

The hash function we’ll use is called sha256, and we add it in this pull request. Now the database application shouldn’t need a ton of disk space to run. (We’ll confirm this later in the article)

For more background and related topics around hash functions, see Load Balancing and the Power of Hashing and Hashing to Estimate the Size of a Stream.

Worker architecture

Next we want to rearchitect the application to enable two things: better protection against the possibility of duplicating/skipping divisor sum computations, and allowing us to scale up the rate of computation by adding additional computers to the mix. This pull request achieves the goal, but note that it’s a large change, so let’s cover the broad ideas, then zoom in on a few details, and finally discuss the PR’s structure as a whole from the perspective of software engineering.

We start by adding a “state” field to a search block. The state can be one of NOT_STARTED, IN_PROGRESS, FINISHED, FAILED. Then one worker job (read, container) will be responsible for computing new blocks in the NOT_STARTED state, and a group of workers jobs will continually “claim” an eligible block, compute the divisor sums, and then mark it as finished and insert the interesting witness values it found.

This suggests a drastic change to the database interface, as shown in this commit, to align with the new “claim/finish” workflow. We replace the existing functions with insert_search_blocks, claim_next_search_block, and finish_search_block. The first two operate only on the SearchMetadata table, while the third couples together the changes to both tables, by marking a block as finished and inserting the relevant witness values. We’ll see how we ensure these two updates occur atomically in a moment. Also, because of the new coupling of the two tables, I merged the two database interfaces into one.

We need two safety measures to make this work. First, we need to ensure that we don’t accidentally generate multiple duplicate search blocks. We do this by adding a uniqueness constraint on the bounds of a search block at the database level. If a client calls insert_search_blocks with something that’s already in the database, the database itself will reject it, and the client will be forced to catch the failure and recover.

Second, we need to ensure that two workers don’t claim the same search block. This involves a bit of database wizardry. I had to do a bit of research to make it work, and it took a few attempts. The gist of the problem is that if you want to claim a search block, you need to do two queries. First, you need to look up the oldest eligible search block that you want to claim. Then, you need to mark it as claimed. But if you have multiple workers accessing the database simultaneously, you can get this order of operations

1. Worker 1 runs lookup query, gets search block 7
2. Worker 2 runs lookup query, also gets search block 7
3. Worker 1 runs update query, marks search block 7 as claimed
4. Worker 2 runs update query, also marks search block 7 as claimed.

This can even happen if there’s only “one” query, and the lookup step is a subquery. After reading a bit about PostgreSQL’s transaction isolation guarantees, I learned that you can “lock” the rows read by the subquery until the update step is complete by adding FOR UPDATE to the subquery. This commit has the full query, but it looks something like this

UPDATE SearchMetadata
SET state = 'IN_PROGRESS'
FROM (
SELECT starting_search_index
FROM SearchMetadata
WHERE state = 'NOT_STARTED'
ORDER BY creation_time ASC
LIMIT 1
FOR UPDATE         <---- key clause!
) as m
WHERE
starting_search_index = m.starting_search_index;


I also added some special tests (that use the finished worker jobs) that will specifically fail when this FOR UPDATE clause is removed, to confirm it fixes the behavior. This is particularly important because it’s hard to test database queries that protect against race conditions. Plus, I tried other approaches and got it wrong, which proved the value of having such tests.

Finally, the finish_search_block function needs to ensure that the update of the SearchMetadata and RiemannDivisorSums tables happen atomically. This is done in this commit by leveraging the concept of a database transaction. In a transaction, all of the queries between BEGIN and COMMIT happen at once or not at all, e.g., in the case that a later query fails. The “not at all” case is called a rollback.

Thankfully, the psycopg2 library we’re using uses transactions by default, and requires us to call connection.commit to make any query persist. So we are already using transactions and get the guarantee without extra work.

Next, we needed to update the SearchStrategy interface to separate the task of generating new search blocks and processing blocks, since these will be done by separate workers. This commit updates the interface, this commit the tests, and this commit the implementation.

Finally, once all the rearchitecting and re-testing is done, we ditched our old populate_database entry script, and replaced it with a worker script for generating new search blocks, and one for claiming and processing blocks. These have the same general structure: infinitely loop, use the database and search strategy interfaces as expected, and then exponentially back off when failures occur, eventually quitting if there are too many repeated failures. In the case of the generator worker, we also have configuration around how many new blocks to create, what minimum threshold to trigger doing work, and a delay between checks (since the workers will be much slower than the generator).

Ship it!

Now we’re ready to launch it, and this time since we have improved our memory issues, we can use an AWS instance with smaller memory requirements, but launch four of them to allow one for each container. After a few of the following steps, this worked out just fine:

• Configure an AWS security group (which limits which IP addresses EC2 instances can communicate with) so that Postgres communication was allowed between all the launched instances. This was made easy by the fact that you can configure a security group once so that it allows communication to and from any other instances with the same security group (in addition to my laptop’s IP).
• Keep track of the ip addresses of each of the instances I launched, and make sure that the instance with a 60 GiB disk is the instance I put the database on.
• Manually install docker and launch the right container on each one, and point their PGHOST variables to the address of the database instance.

I will improve this workflow in the next article, but for now I am letting it run for a few days and observing that it works. The generator job adds 100 new search blocks any time there are less than 100 eligible blocks, and this query shows they are getting worked on properly.

divisor=# select state, count(*) from searchmetadata group by state;
state    | count
-------------+-------
NOT_STARTED |   129
FINISHED    |   367
IN_PROGRESS |     4


As I was stopping and restarting jobs, two search blocks got stuck in the “in_progress” state, as you can see from the lagging-behind start time.

divisor=# select start_time, starting_search_index, ending_search_index from searchmetadata where state='IN_PROGRESS' order by start_time asc;
start_time         | starting_search_index | ending_search_index
----------------------------+-----------------------+---------------------
2021-02-09 04:46:13.616085 | 64,433492             | 64,683491    <-- stuck!
2021-02-09 04:48:08.554847 | 65,191862             | 65,441861    <-- stuck!
2021-02-09 14:22:08.331754 | 78,9803652            | 78,10053651
2021-02-09 14:22:44.36657  | 78,10053652           | 78,10303651
(4 rows)


This suggests I should create a new job for cleaning up “stale” IN_PROGRESS blocks and marking them as failed. I’ll do this in the next article, and it won’t hurt us to leave some blocks in the transient state, because once the cleanup job is running it will make those blocks eligible to be worked on again, and the worker’s claim the earliest block first.

I also noticed an issue where I forgot a commit statement, which didn’t show up in testing, and I can’t quite figure out how to set up a test that requires this commit statement to be present. Nevertheless, after I noticed the problem, I patched it, stopped the containers, ran git pull on each instance, rebuilt the containers, and started them again, and it continued along as if nothing was wrong.

Since the search application is running well now for long periods of time, I’ll let it run for a week or two (or until something breaks), and return to it in the next article to see what bigger witness values we find. In the mean time, I’d like to discuss how I organized all of the changes that went into that massive pull request.

How to organize a big pull request

The pull request to change the application architecture is quite substantial, and I had a lot of cleanup I wanted to do along the way.

The basic pattern of the PR, which was repeated a few times throughout, was to split each unit of change into three parts: update the interface, update the tests that use the interface, and update the implementation. As an example, one commit updates the SearchStrategy interface, one the tests, and one the implementation.

Doing it this way helps in a few ways. First, I think a lot about how the interface should work, and I prove that it feels right by writing tests first. That way, I don’t get bogged down in the implementation details when thinking about how it will be used. Likewise, when doing this for the database change, after updating the interface and tests, I first updated the InMemoryDatabase implementation, because it was simpler, and having this made me confident the interface could be implemented decently, and allowed me to tweak the interface early, before turning to the harder problem of figuring out PostgreSQL FOR UPDATE stuff.

Much like the proof of a mathematical theorem, the final commits hide the work that went into this, and commit messages, code comments, and PR description provide a deeper explanation. Typically what I do is make changes, and then use git’s staging area concept to pull out different sub-changes within a file into a single commit. This blog post about “using git well” covers this in more detail. But this, and git’s rebase and amend features, allowed me to edit older commits so that the final pull request lays out the commits in a logical order.

It also allows me to split off the cleanup parts as best as possible, such as this commit to rename SearchState to SearchIndex (after I added the new “state” field). I made the commit, had some places I missed and fixed later, extracted them into their own commit, and rebased and squashed them together to unify the change. Keeping each commit as close to an atomic thought as possible (“add this test” or “implement this method”) enables easy squashing and reordering.

Though nobody is reviewing my code now (except you, dear reader, after the fact), a primary benefit of all this cleanup is that the code is much easier to review. The pull request as a whole is a feature, each commit is a smaller, but comprehensible piece of that bigger feature, and the commits are laid out in such a way that if you browse the commits in order, you get a clear picture of how the whole feature was assembled.

If you find yourself doing mathematical work in the context of software, you should expect to spend a lot of time reviewing code and having your code reviewed. The extra effort you put into making that process easier pays off by producing better feedback, uncovering more bugs, and reducing total review time. Though it might be considered a soft skill, smooth code reviews are a critical part of a well-functioning software organization.

Searching for RH Counterexamples — Performance Profiling

We’re ironically searching for counterexamples to the Riemann Hypothesis.

In the last article we ran into some performance issues with our deployed docker application. In this article we’ll dig in to see what happened, fix the problem, run into another problem, fix it, and run the search until we rule out RH witness-value-based counterexamples among all numbers with fewer than 85 prime factors.

When debugging any issue, there are some straightforward steps to follow.

1. Gather information to determine where the problem is.
2. Narrow down the source of the problem.
3. Reproduce the problem in an isolated environment (on your local machine or a separate EC2 instance).
4. Fix the problem.

So far what I know is that the search ran for days on EC2 with docker, and didn’t make significantly more progress than when I ran it on my local machine for a few hours.

Gathering information

Gathering information first is an important step that’s easy to skip when you think you know the problem. In my experience, spending an extra 5-10 minutes just looking around can save hours of work you might spend fixing the wrong problem.

So we know the application is slow. But there are many different kinds of slow. We could be using up all the CPU, running low on free RAM, throttled by network speeds, and more. To narrow down which of these might be the problem, we can look at the server’s resource usage.

The CPU utilization EC2 dashboard below shows that after just a few hours of running the application, the average CPU utilization drops to 5-10% and stays at that level. Running docker stats shows that the search container has over 5 GiB of unused RAM. df -h shows that the server has more than 60% of its disk space free.

The fact that the CPU spikes like this suggests that the program is doing its work, just with gaps in between the real work, and some sort of waiting consuming the rest of the time.

Another angle we can look at is the timestamps recorded in the SearchMetadata table.

divisor=# select * from searchmetadata order by start_time desc limit 9;
start_time         |          end_time          |       search_state_type       | starting_search_state | ending_search_state
----------------------------+----------------------------+-------------------------------+-----------------------+---------------------
2021-01-01 15:40:42.374536 | 2021-01-01 16:15:34.838774 | SuperabundantEnumerationIndex | 71,1696047            | 71,1946047
2021-01-01 15:06:13.947216 | 2021-01-01 15:40:42.313078 | SuperabundantEnumerationIndex | 71,1446047            | 71,1696047
2021-01-01 14:32:14.692185 | 2021-01-01 15:06:13.880209 | SuperabundantEnumerationIndex | 71,1196047            | 71,1446047
2021-01-01 13:57:39.725843 | 2021-01-01 14:32:14.635433 | SuperabundantEnumerationIndex | 71,946047             | 71,1196047
2021-01-01 13:25:53.376243 | 2021-01-01 13:57:39.615891 | SuperabundantEnumerationIndex | 71,696047             | 71,946047
2021-01-01 12:59:14.58666  | 2021-01-01 13:25:53.331857 | SuperabundantEnumerationIndex | 71,446047             | 71,696047
2021-01-01 12:43:08.503441 | 2021-01-01 12:59:14.541995 | SuperabundantEnumerationIndex | 71,196047             | 71,446047
2021-01-01 12:27:49.698012 | 2021-01-01 12:43:08.450301 | SuperabundantEnumerationIndex | 70,4034015            | 71,196047
2021-01-01 12:14:44.970486 | 2021-01-01 12:27:49.625687 | SuperabundantEnumerationIndex | 70,3784015            | 70,4034015


As you can see, computing a single block of divisor sums takes over a half hour in many cases! This is nice, because we can isolate the computation of a single block on our local machine and time it.

Narrowing Down

Generally, since we ran this application on a local machine just fine for longer than we ran it in docker on EC2, I would think the culprit is related to the new thing that was introduced: docker and/or EC2. My local machine is a Mac, and the EC2 machine was Ubuntu linux, so there could be a difference there. It’s also strange that the system only started to slow down after about two hours, instead of being slow the whole time. That suggests there is something wrong with scaling, i.e., what happens as the application starts to grow in its resource usage.

Just to rule out the possibility that it’s a problem with computing and storing large blocks, let’s re-test the block with starting index 71,196047 and ending index 71,446047, which took approximated 16 minutes on EC2. These three lines are between the start/end timestamps in the table. Only the second line does substantive work.

start_state = search_strategy.search_state()
db.upsert(search_strategy.next_batch(batch_size))
end_state = search_strategy.search_state()


First we’ll just run the next_batch method, to remove the database upsert from consideration. We’ll also do it in a vanilla Python script, to remove the possibility that docker is introducing inefficiency, which is the most likely culprit, since docker is the new part from the last article. This commit has the timing test and the result shows that on average the block takes two minutes on my local machine.

Running the same code in a docker container on EC2 has the same result, which I verified by copying the divisorsearch.Dockerfile to a new dockerfile and replacing the entrypoint command with

ENTRYPOINT ["python3", "-u", "-m", "timing.ec2_timing_test"]


Then running (on a fresh EC2 instance with docker installed and the repo cloned)

docker build -t timing -f timing.Dockerfile .
docker run -dit --name timing --memory="15G" --env PGHOST="\$PGHOST" timing:latest
docker logs -f timing


Once it finishes, I see it takes on average 40 seconds to compute the block. So any prior belief that the divisor computation might be the bottleneck are now gone.

Now let’s try the call to db.upsert, which builds the query and sends it over the network to the database container. Testing that in isolation—by computing the block first and then timing the upsert with this commit —shows it takes about 8 seconds to update an empty database with this block.

Both of these experiments were run on EC2 in docker, so we’ve narrowed down the problem enough to say that it only happens when the system has been running for that two hour window. My best guess is something to do with having a large database, or something to do with container-to-container networking transfer rates (the docker overhead). The next step in narrowing down the problem is to set up the scenario and run performance profiling tools on the system in situ.

I spin back up an r5.large. Then I run the deploy script and wait for two hours. It started slowing down after about 1h45m of runtime. At this point I stopped the divisorsearch container and ran the timing test container. And indeed, the upsert step has a strangely slow pattern (the numbers are in seconds):

Running sample 0
55.63926291465759
Running sample 1
61.28182792663574
Running sample 2
245.36470413208008  # 4 minutes
Running sample 3
1683.663686990738   # 28 minutes


At this point I killed the timing test. The initial few writes, while not great, are not as concerning as the massive degradation in performance over four writes. At this point I’m convinced that the database is the problem. So I read a lot about Postgres performance tuning (e.g. these docs and this blog post), which I had never done before.

It appears that the way Postgres inserts work is by writing to a “write cache,” and then later a background process copies the content of the write cache to disk. This makes sense because disk writes are slow. But if the write cache is full, then Postgres will wait until the write cache gets emptied before it can continue. This seems likely here: the first few writes are fast, and then when the cache fills up later writes take eons.

I also learned that the PRIMARY KEY part of our Postgres schema incurs a penalty to write performance, because in order to enforce the uniqueness Postgres needs to maintain a data structure and update it on every write. I can imagine that, because an mpz (unbounded integer) is our primary key, that data structure might be unprepared to handle such large keys. This might explain why these writes are taking so long in the first place, given that each write itself is relatively small (~2 MiB). In our case we can probably get by without a primary key on this table, and instead put any uniqueness constraints needed on the search metadata table.

So let’s change one of these and see if it fixes the problem. First we can drop the primary key constraint, which we can do in situ by running the query

alter table riemanndivisorsums drop constraint divisor_sum_pk;


This query takes a long time, which seems like a good sign. Perhaps it’s deleting a lot of data. Once that’s done, I removed the ON CONFLICT clause from the upsert command (it becomes just an insert command), and then rebuild and re-run the timing container. This shows that each insert takes only 3-4 seconds, which is much more reasonable.

Running sample 0
3.8325071334838867
Running sample 1
3.7897982597351074
Running sample 2
3.7978150844573975
Running sample 3
3.810023784637451
Running sample 4
3.8057897090911865


I am still curious about the possible fix by increasing the cache size, but my guess is that without the primary key change, increasing the cache size would just delay the application from slowing instead of fixing the problem permanently, so for now I will not touch it. Perhaps a reader with more experience with Postgres tuning would know better.

Updating the application

So with this new understanding, how should we update the application? Will anything go wrong if we delete the primary key on the RiemannDivisorSums table?

As suggested by our need to remove the ON CONFLICT clause, we might end up with duplicate rows. That is low risk, but scarier is if we stop the search mid-block and restart it, then if we’re really unlucky, we might get into a state where we skip a block and don’t notice, and that block contains the unique counterexample to the Riemann Hypothesis! We simply can’t let that happen. This is far too important. To be fair, this is a risk even before we remove the primary key. I’m just realizing it now.

We’ll rearchitect the system to mitigate that in a future post, and it will double as providing a means to scale horizontally. For now, this pull request removes the primary key constraint and does a bit of superficial cleanup. Now let’s re-run it!

Out of RAM again

This time, it ran for three hours before the divisorsearch container hit the 15 GiB RAM limit and crashed. Restarting the container and watching docker stats showed that it plain ran out of RAM. This is a much easier problem to solve because it involves only one container, and the virtual machine doesn’t slow to a crawl when it occurs, the container just stops running.

A quick python memory profile reveals the top 10 memory usages by line, the top being search_strategy.py:119 clocking in at about 2.2 GiB.

[ Top 10 ]
riemann/search_strategy.py:119: size=2245 MiB, count=9394253, average=251 B
venv/lib/python3.7/site-packages/llvmlite/ir/values.py:224: size=29.6 MiB, count=9224, average=3367 B
<string>:2: size=26.7 MiB, count=499846, average=56 B
riemann/superabundant.py:67: size=19.1 MiB, count=500003, average=40 B
riemann/superabundant.py:69: size=15.3 MiB, count=250000, average=64 B
riemann/superabundant.py:70: size=13.4 MiB, count=250000, average=56 B
venv/lib/python3.7/site-packages/llvmlite/ir/_utils.py:48: size=3491 KiB, count=6411, average=558 B
venv/lib/python3.7/site-packages/llvmlite/ir/values.py:215: size=2150 KiB, count=19267, average=114 B
riemann/search_strategy.py:109: size=2066 KiB, count=1, average=2066 KiB
venv/lib/python3.7/site-packages/llvmlite/ir/_utils.py:58: size=1135 KiB, count=2018, average=576 B


This line, not surprisingly, is the computation of the full list of partitions of $n$. The length of this list grows superpolynomially in $n$, which explains why it takes up all the RAM.

Since we only need to look at at most batch_size different partitions of $n$ at a given time, we can probably fix this by adding a layer of indirection so that new sections of the partition list are generated on the fly and old sections are forgotten once the search strategy is done with them.

This pull request does that, and running the memory test again shows the 2.5 GiB line above reduced to 0.5 GiB. The pull request description also has some thoughts about the compute/memory tradeoff being made by this choice, and thoughts about how to improve the compute side.

So let’s run it again!! This time it runs for about 19 hours before crashing, as shown by the EC2 CPU usage metrics.

According to the SearchMetadata table, we got up to part way through $n=87$

divisor=# select * from searchmetadata order by start_time desc limit 3;
start_time         |          end_time          |       search_state_type       | starting_search_state | ending_search_state
----------------------------+----------------------------+-------------------------------+-----------------------+---------------------
2021-01-27 00:00:11.59719  | 2021-01-27 00:01:14.709715 | SuperabundantEnumerationIndex | 87,15203332           | 87,15453332
2021-01-26 23:59:01.285508 | 2021-01-27 00:00:11.596075 | SuperabundantEnumerationIndex | 87,14953332           | 87,15203332
2021-01-26 23:57:59.809282 | 2021-01-26 23:59:01.284257 | SuperabundantEnumerationIndex | 87,14703332           | 87,14953332


Logging into the server and running df -h we can see the reason for the crash is that the disk is full. We filled up a 60 GiB SSD full of Riemann divisor sums. It’s quite an achievement. I’d like to thank my producers, the support staff, my mom, and the Academy.

But seriously, this is the best case scenario. Our limiting factor now is just how much we want to pay for disk space (and/or, any tricks we come up with to reduce disk space).

Analyzing the results, it seems we currently have 949 examples of numbers with witness values bigger than 1.767. Here are the top few:

                                                                                       n                                                                                        |                                                                                   divisor_sum                                                                                   |   witness_value
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------
533187564151227457465199401229454876347036513892234205802944360099435118364718466037392872608220305945979716166395732328054742493039981726997486787797703088097204529280000    | 5634790045188254963919691913741193557789227457256740288527114259589826494137632464425981545755572664137227875215269764405574488353138596297944567958732800000000000000000000    | 1.7689901658397056
1392134632248635659178066321747923959130643639405311242305337754828812319490126543178571468950966856255821713228187290673772173611070448373361584302343872292681996160000      | 14673932409344413968540864358701024890076113169939427834706026717681839828483417876109326942071803812857364258373098344806183563419631761192563979059200000000000000000000      | 1.7688977455109272
3673178449204843427910465228886342900080853929829317262019360830682882109472629401526573796704398037614305311947723722094385682351109362462695473093255599716839040000         | 38615611603537931496160169365002697079147666236682704828173754520215367969693204937129807742294220560150958574666048275805746219525346739980431523840000000000000000000         | 1.7688303488154073
2784269264497271318356132643495847918261287278810622484610675509657624638980253086357142937901933712511643426456374581347544347222140896746723168604687744585363992320000      | 29355033324995298095346770663840105972086801871471400577978104254473441181064775868425839681379597759477726740614478613571725301516068423098949435392000000000000000000000      |  1.768798862584946
9847663402693950208875241900499578820592101688550448423644399009873678577674609655567221975078815114247467324256631962719532660458738237165403413118647720420480000            | 103250298405181635016471041082894911976330658386852151946988648449773711148912312666122480594369573690243204745096385764186487217982210534707036160000000000000000000           | 1.7687597437473672


The best so far achieves 1.76899. Recall out last best number achieved a witness value of 1.7679, a modest improvement if we hope to get to 1.82 (or even the supposedly infinitely many examples with witness value bigger than 1.81!).

What’s next?

We’re not stopping until we disprove the Riemann hypothesis. So strap in, this is going to be a long series.

Since the limiting factor in our search is currently storage space, I’d like to spend some time coming up with a means to avoid storing the witness value and divisor sum for every single number in our search strategy, but still maintain the ability to claim our result (no counterexamples below N) while providing the means for a skeptical person to verify our claims as much as they desire without also storing the entire search space on disk.

Once we free ourselves from disk space constraints, the second limiting factor is that our search uses only a single CPU at a time. We should rearchitect the system so that we can scale horizontally, meaning we can increase the search rate by using more computers. We can do this by turning our system into a “worker” model, in which there is a continually growing queue of search blocks to process, and each worker machine asks for the next block to compute, computes it, and then stores the result in the database. There are some tricks required to ensure the workers don’t step on each other’s toes, but it’s a pretty standard computational model.

There’s also the possibility of analyzing our existing database to try to identify what properties of the top $n$ values make their witness values so large. We could do this by re-computing the prime factorization of each of the numbers and squinting at it very hard. The benefit of doing this would ultimately be to design a new SearchStrategy that might find larger witness values faster than the superabundant enumeration. That approach also has the risk that, without a proof that the new search strategy is exhaustive, we might accidentally skip over the unique counterexample to the Riemann hypothesis!

And then, of course, there’s the business of a front-end and plotting. But I’m having a bit too much fun with the back end stuff to worry about that for now. Protest in the comments if you prefer I work on visualization instead.

Postscript: a false start

Here’s one thing I tried during debugging that turned out to be a dead end: the linux perf tool. It is way more detailed than I can understand, but it provides a means to let me know if the program is stuck in the OS kernel or whether it’s the applications that are slow. I ran the perf tool when the deployed application was in the “good performance” regime (early in its run). I saw something like this:

Samples: 59K of event 'cpu-clock:pppH', Event count (approx.): 119984000000
Overhead  Command          Shared Object                                Symbol
49.13%  swapper          [kernel.kallsyms]                            [k] native_safe_halt
8.04%  python3          libpython3.7m.so.1.0                         [.] _PyEval_EvalFrameDefault
1.35%  python3          libpython3.7m.so.1.0                         [.] _PyDict_LoadGlobal
0.86%  python3          libc-2.28.so                                 [.] realloc
0.86%  python3          libpython3.7m.so.1.0                         [.] 0x0000000000139736
0.77%  python3          libpython3.7m.so.1.0                         [.] PyTuple_New
0.74%  python3          libpython3.7m.so.1.0                         [.] PyType_IsSubtype



It says half the time is spent by something called “swapper” doing something called native_safe_halt, and this in kernel mode ([k]), i.e., run by the OS. The rest of the list is dominated by python3 and postgres doing stuff like _PyDict_LoadGlobal which I assume is useful Python work. When I look up native_safe_halt (finding an explanation is surprisingly hard), I learn that this indicates the system is doing nothing. So 49% of the time the CPU is idle. This fits with good behavior in our case, because each docker container gets one of the two CPUs on the host, and the postgres container is most often doing nothing while waiting for the search container to send it data. This also matches the CPU utilization on the EC2 dashboard. So everything appears in order.

I ran perf again after the application started slowing down, and I saw that native_safe_halt is at 95%. This tells me nothing new, sadly. I also tried running it during the timing test and saw about 40% usage of some symbols like __softirqentry_text_start and __lock_text_start. Google failed to help me understand what that was, so it was a dead end.