Searching for RH Counterexamples — Exploring Data

We’re ironically searching for counterexamples to the Riemann Hypothesis.

  1. Setting up Pytest
  2. Adding a Database
  3. Search Strategies
  4. Unbounded integers
  5. Deploying with Docker
  6. Performance Profiling
  7. Scaling up
  8. Productionizing

In the last article we added a menagerie of “production readiness” features like continuous integration tooling (automating test running and static analysis), alerting, and a simple deployment automation. Then I let it loose on AWS, got extremely busy with buying a house, forgot about this program for a few weeks (no alerts means it worked flawlessly!), and then saw my AWS bill.

So I copied the database off AWS using pg_dump (piped to gzip), terminated the instances, and inspected the results. A copy of the database is here. You may need git-lfs to clone it. If I wanted to start it back up again, I could spin them back up, and use gunzip | psql to restore the database, and it would start back up from where it left off. A nice benefit of all the software engineering work done thus far.

This article will summarize some of the data, show plots, and try out some exploratory data analysis techniques.


We stopped the search mid-way through the set of numbers with 136 prime divisors.

The largest number processed was


Which in factored form is the product of these terms

  2^8   3^7   5^4   7^4  11^3  13^3  17^2  19^2  23^2  29^2
 31^2  37^2  41^2  43^1  47^1  53^1  59^1  61^1  67^1  71^1
 73^1  79^1  83^1  89^1  97^1 101^1 103^1 107^1 109^1 113^1
127^1 131^1 137^1 139^1 149^1 151^1 157^1 163^1 167^1 173^1
179^1 181^1 191^1 193^1 197^1 199^1 211^1 223^1 227^1 229^1
233^1 239^1 241^1 251^1 257^1 263^1 269^1 271^1 277^1 281^1
283^1 293^1 307^1 311^1 313^1 317^1 331^1 337^1 347^1 349^1
353^1 359^1 367^1 373^1 379^1 383^1 389^1 397^1 401^1 409^1
419^1 421^1 431^1 433^1 439^1 443^1 449^1 457^1 461^1 463^1
467^1 479^1 487^1 491^1 499^1 503^1 509^1 521^1 523^1 541^1
547^1 557^1 563^1 569^1 571^1 577^1

The best witness—the number with the largest witness value—was


which has witness value 1.7707954880001586, which is still significantly smaller than the needed 1.782 to disprove RH.

The factored form of the best witness is

 2^11   3^7   5^4   7^3  11^3  13^2  17^2  19^2  23^2  29^2 
 31^2  37^1  41^1  43^1  47^1  53^1  59^1  61^1  67^1  71^1 
 73^1  79^1  83^1  89^1  97^1 101^1 103^1 107^1 109^1 113^1 
127^1 131^1 137^1 139^1 149^1 151^1 157^1 163^1 167^1 173^1 
179^1 181^1 191^1 193^1 197^1 199^1 211^1 223^1 227^1 229^1 
233^1 239^1 241^1 251^1 257^1 263^1 269^1 271^1 277^1 281^1 
283^1 293^1 307^1 311^1 313^1 317^1 331^1 337^1 347^1 349^1 
353^1 359^1 367^1 373^1 379^1 383^1 389^1 397^1 401^1 409^1 
419^1 421^1 431^1 433^1 439^1 443^1 449^1 457^1 461^1 463^1 
467^1 479^1 487^1 491^1 499^1 503^1 509^1 521^1 523^1 541^1 
547^1 557^1 563^1 

The average search block took 4m15s to compute, while the max took 7m7s and the min took 36s.

The search ran for about 55 days (hiccups included), starting at 2021-03-05 05:47:53 and stopping at 2021-04-28 15:06:25. The total AWS bill—including development, and periods where the application was broken but the instances still running, and including instances I wasn’t using but forgot to turn off—was $380.25. When the application was running at its peak, the bill worked out to about $100/month, though I think I could get it much lower by deploying fewer instances, after we made the performance optimizations that reduced the need for resource-heavy instances. There is also the possibility of using something that integrates more tightly with AWS, such as serverless jobs for the cleanup, generate, and process worker jobs.


When in doubt, plot it out. I started by writing an export function to get the data into a simpler CSV, which for each n only stored \log(n) and the witness value.

I did this once for the final computation results. I’ll call this the “small” database because it only contains the largest witness value in each block. I did it again for an earlier version of the database before we introduced optimizations (I’ll call this the “large” database), which had all witness values for all superabundant numbers processed up to 80 prime factors.. The small database was only a few dozen megabytes in size, and the large database was ~40 GiB, so I had to use postgres cursors to avoid loading the large database into memory. Moreover, then generated CSV was about 8 GiB in size, and so it required a few extra steps to sort it, and get it into a format that could be plotted in a reasonable amount of time.

First, using GNU sort to sort the file by the first column, \log(n)

sort -t , -n -k 1 divisor_sums.csv -o divisor_sums_sorted.csv

Then, I needed to do some simple operations on massive CSV files, including computing a cumulative max, and filtering down to a subset of rows that are sufficient for plotting. After trying to use pandas and vaex, I realized that the old awk command line tool would be great at this job. So I wrote a simple awk script to process the data, and compute data used for the cumulative max witness value plots below.

Then finally we can use vaex to create two plots. The first is a heatmap of witness value counts. The second is a plot of the cumulative max witness value. For the large database:

Witness value heatmap for the large database
The cumulative maximum witness value for the large database.

And for the small database

A heatmap for the witness values for the small database
The cumulative maximum witness value for the small database.

Note, the two ridges disagree slightly (the large database shows a longer flat line than the small database for the same range), because of the way that the superabundant enumeration doesn’t go in increasing order of n. So larger witness values in the range 400-500 are found later.

Estimating the max witness value growth rate

The next obvious question is whether we can fit the curves above to provide an estimate of how far we might have to look to find the first witness value that exceeds the desired 1.782 threshold. Of course, this will obviously depend on the appropriateness of the underlying model.

A simple first guess would be split between two options: the real data is asymptotic like a + b / x approaching some number less than 1.782 (and hence this approach cannot disprove RH), or the real data grows slowly (perhaps doubly-logarithmic) like a + b \log \log x, but eventually surpasses 1.782 (and RH is false). For both cases, we can use scipy’s curve fitting routine as in this pull request.

For the large database (roughly using log n < 400 since that’s when the curve flatlines due to the enumeration order), we get a reciprocal fit of

\displaystyle f(x) \approx 1.77579122 - 2.72527824 / x

and a logarithmic fit of

\displaystyle f(x) \approx 1.65074314 + 0.06642373 \log(\log(x))

The fit of the large database to a + b/x. Note the asymptote of 1.7757 suggests this will not disprove RH.
The fit of the large database to a + b log log x. If this is accurate, we would find the counterexample around log(n) = 1359.

The estimated asymptote is around 1.7757 in the first case, and the second case estimates we’d find an RH counterexample at around log(n) = 1359.

For the small database of only sufficiently large witness values, this time going up to about log(n) \approx 575, the asymptotic approximation is

\displaystyle 1.77481154 -2.31226382 / x

And the logarithmic approximation is

\displaystyle 1.70825262 + 0.03390312 \log(\log(x))

The reciprocal approximation of the small database with asymptote 1.77481154
The logarithmic approximation of the small database with RH counterexample estimate at log(n) = 6663

Now the asymptote is slightly lower, at 1.7748, and the logarithmic model approximates the counterexample can be found at approximately \log(n) = 6663.

Both of the logarithmic approximations suggest that if we want to find an RH counterexample, we would need to look at numbers with thousands of prime factors. The first estimate puts a counterexample at about 2^{1960}, the second at 2^{9612}, so let’s say between 1k and 10k prime factors.

Luckily, we can actually jump forward in the superabundant enumeration to exactly the set of candidates with m prime factors. So it might make sense to jump ahead to, say, 5k prime factors and search in that region. However, the size of a level set of the superabundant enumeration still grows exponentially in m. Perhaps we should (heuristically) narrow down the search space by looking for patterns in the distribution of prime factors for the best witness values we’ve found thus far. Perhaps the values of n with the best witness values tend to have a certain concentration of prime factors.

Exploring prime factorizations

At first, my thought was to take the largest witness values, look at their prime factorizations, and try to see a pattern when compared to smaller witness values. However, other than the obvious fact that the larger witness values correspond to larger numbers (more and larger prime factors), I didn’t see an obvious pattern from squinting at plots.

To go in a completely different direction, I wanted to try out the UMAP software package, a very nice and mathematically sophisticated for high dimensional data visualization. It’s properly termed a dimensionality reduction technique, meaning it takes as input a high-dimensional set of data, and produces as output a low-dimensional embedding of that data that tries to maintain the same shape as the input, where “shape” is in the sense of a certain Riemannian metric inferred from the high dimensional data. If there is structure among the prime factorizations, then UMAP should plot a pretty picture, and perhaps that will suggest some clearer approach.

To apply this to the RH witness value dataset, we can take each pair (n, \sigma(n)/(n \log \log n)), and associate that with a new (high dimensional) data point corresponding to the witness value paired with the number’s prime factorization

\displaystyle (\sigma(n)/(n \log \log n), k_1, k_2, \dots, k_d),

where n = 2^{k_1} 3^{k_2} 5^{k_3} \dots p_d^{k_d}, with zero-exponents included so that all points have the same dimension. This pull request adds the ability to factorize and export the witness values to a CSV file as specified, and this pull request adds the CSV data (using git-lfs), along with the script to run UMAP, the resulting plots shown below, and the saved embeddings as .npy files (numpy arrays).

When we do nothing special to the data and run it through UMAP we see this plot.

UMAP plotted on the raw prime factorization and witness value dataset.

It looks cool, but if you stare at it for long enough (and if you zoom in when you generate the plot yourself in matplotlib) you can convince yourself that it’s not finding much useful structure. The red dots dominate (lower witness values) and the blue dots are kind of spread haphazardly throughout the red regions. The “ridges” along the chart are probably due to how the superabundant enumeration skips lots of numbers, and that’s why it thins out on one end: the thinning out corresponds to fewer numbers processed that are that large since the enumeration is not uniform.

It also seemed like there is too much data. The plot above has some 80k points on it. After filtering down to just those points whose witness values are bigger than 1.769, we get a more manageable plot.

Witness values and prime factors processed with UMAP, where the witness value is at least 1.769.

This is a bit more reasonable. You can see a stripe of blue dots going through the middle of the plot.

Before figuring out how that blue ridge might relate to the prime factor patterns, let’s take this a few steps further. Typically in machine learning contexts, it helps to normalize your data, i.e., to transform each input dimension into a standard Z-score with respect to the set of values seen in that dimension, subtracting the mean and dividing by the standard deviation. Since the witness values are so close to each other, they’re a good candidate for such normalization. Here’s what UMAP plots when we normalize the witness value column only.

UMAP applied to the (normalized) witness values and prime factorizations. Applied to all witness values.

Now this is a bit more interesting! Here the colormap on the right is in units of standard deviation of witness values. You can see a definite bluest region, and it appears that the data is organized into long brushstrokes, where the witness values increase as you move from one end of the stroke to the other. At worst, this suggests that the dataset has structure that a learning algorithm could discover.

Going even one step further, what if we normalize all the columns? Well, it’s not as interesting.

UMAP when normalizing all columns, not just the witness value.

If you zoom in, you can see that the same sort of “brushstroke” idea is occurring here too, with blue on one end and red on the other. It’s just harder to see.

The previous image, zoomed in around a cluster of data

We would like to study the prettiest picture and see if we can determine what pattern of prime numbers the blue region has, if any. The embedding files are stored on github, and I put up (one version of) the UMAP visualization as an interactive plot via this pull request.

I’ve been sitting on this draft for a while, and while this article didn’t make a ton of headway, the pictures will have to do while I’m still dealing with my new home purchase.

Some ideas for next steps:

  • Squint harder at the distributions of primes for the largest witness values in comparison to the rest.
  • See if a machine learning algorithm can regress witness values based on their prime factorizations (and any other useful features I can derive). Study the resulting hypothesis to determine which features are the most important. Use that to refine the search strategy.
  • Try searching randomly in the superabundant enumeration around 1k and 10k prime factors, and see if the best witness values found there match the log-log regression.
  • Since witness values above a given threshold seem to be quite common, and because the UMAP visualization shows some possible “locality” structure for larger witness values, it suggests if there is a counterexample to RH then there are probably many. So a local search method (e.g., local neighborhood search/discrete gradient ascent with random restarts) might allow us to get a better sense for whether we are on the right track.

Until next time!

Searching for RH Counterexamples — Scaling Up

We’re ironically searching for counterexamples to the Riemann Hypothesis.

  1. Setting up Pytest
  2. Adding a Database
  3. Search Strategies
  4. Unbounded integers
  5. Deploying with Docker
  6. Performance Profiling

Last time we made the audacious choice to remove primary keys from the RiemannDivisorSums table for performance reasons. To help with that, we will do two things in this post

  • Reduce the storage footprint of the whole application (it was 60 GiB when it crashed, and we got up to 84 prime factors).
  • Refactor the application into a worker architecture.

The first one is straightforward, but uses a possibly new idea to the mathematician, and the second is a relatively major rearchitecture.

Hashing a search block

Instead of storing every single Riemann divisor sum, we can just store the ones that are interesting, that is, the ones that are larger than some threshold—in our case, 1.767, which had just under a thousand examples in the last post.

That raises a problem. We want to be able to say that we checked all examples up to some limit. We’ve been saying “up to all numbers having at most K prime factors” because that’s how the superabundant search strategy enumerates the candidates. If we don’t store all the witness values, how can we prove to a skeptical observer that we checked what we said we checked?

One approach is to instead store a “summary string” derived from the full set of witness values. This “summary string” should be deterministically reproducible, so that if someone has the metadata about the block of numbers, they can recompute the summary and compare it with our summary.

A simple example summary might be the string-concatenation of all the witness values one after another, like 1.4323,1.5678,0.8792. But that still takes up a lot of space (each block had 250,000 different witness values). It would be better if we could find a summary that had a fixed size, regardless of how big the input is. Taking a cue from computer science, a hash function does exactly that. Hash functions aren’t perfect, they can have collisions, but the collisions are unlikely enough that, if you tested a hundred random blocks and all your hashes agreed with all of my hashes, you would be confident I checked everything the same as you. That gives you confidence that the rest of my blocks (say, a million of them) were computed correctly.

The hash function we’ll use is called sha256, and we add it in this pull request. Now the database application shouldn’t need a ton of disk space to run. (We’ll confirm this later in the article)

For more background and related topics around hash functions, see Load Balancing and the Power of Hashing and Hashing to Estimate the Size of a Stream.

Worker architecture

Next we want to rearchitect the application to enable two things: better protection against the possibility of duplicating/skipping divisor sum computations, and allowing us to scale up the rate of computation by adding additional computers to the mix. This pull request achieves the goal, but note that it’s a large change, so let’s cover the broad ideas, then zoom in on a few details, and finally discuss the PR’s structure as a whole from the perspective of software engineering.

We start by adding a “state” field to a search block. The state can be one of NOT_STARTED, IN_PROGRESS, FINISHED, FAILED. Then one worker job (read, container) will be responsible for computing new blocks in the NOT_STARTED state, and a group of workers jobs will continually “claim” an eligible block, compute the divisor sums, and then mark it as finished and insert the interesting witness values it found.

This suggests a drastic change to the database interface, as shown in this commit, to align with the new “claim/finish” workflow. We replace the existing functions with insert_search_blocks, claim_next_search_block, and finish_search_block. The first two operate only on the SearchMetadata table, while the third couples together the changes to both tables, by marking a block as finished and inserting the relevant witness values. We’ll see how we ensure these two updates occur atomically in a moment. Also, because of the new coupling of the two tables, I merged the two database interfaces into one.

We need two safety measures to make this work. First, we need to ensure that we don’t accidentally generate multiple duplicate search blocks. We do this by adding a uniqueness constraint on the bounds of a search block at the database level. If a client calls insert_search_blocks with something that’s already in the database, the database itself will reject it, and the client will be forced to catch the failure and recover.

Second, we need to ensure that two workers don’t claim the same search block. This involves a bit of database wizardry. I had to do a bit of research to make it work, and it took a few attempts. The gist of the problem is that if you want to claim a search block, you need to do two queries. First, you need to look up the oldest eligible search block that you want to claim. Then, you need to mark it as claimed. But if you have multiple workers accessing the database simultaneously, you can get this order of operations

  1. Worker 1 runs lookup query, gets search block 7
  2. Worker 2 runs lookup query, also gets search block 7
  3. Worker 1 runs update query, marks search block 7 as claimed
  4. Worker 2 runs update query, also marks search block 7 as claimed.

This can even happen if there’s only “one” query, and the lookup step is a subquery. After reading a bit about PostgreSQL’s transaction isolation guarantees, I learned that you can “lock” the rows read by the subquery until the update step is complete by adding FOR UPDATE to the subquery. This commit has the full query, but it looks something like this

UPDATE SearchMetadata
  SELECT starting_search_index
  FROM SearchMetadata
  ORDER BY creation_time ASC
  FOR UPDATE         <---- key clause!
) as m
  starting_search_index = m.starting_search_index;

I also added some special tests (that use the finished worker jobs) that will specifically fail when this FOR UPDATE clause is removed, to confirm it fixes the behavior. This is particularly important because it’s hard to test database queries that protect against race conditions. Plus, I tried other approaches and got it wrong, which proved the value of having such tests.

Finally, the finish_search_block function needs to ensure that the update of the SearchMetadata and RiemannDivisorSums tables happen atomically. This is done in this commit by leveraging the concept of a database transaction. In a transaction, all of the queries between BEGIN and COMMIT happen at once or not at all, e.g., in the case that a later query fails. The “not at all” case is called a rollback.

Thankfully, the psycopg2 library we’re using uses transactions by default, and requires us to call connection.commit to make any query persist. So we are already using transactions and get the guarantee without extra work.

Next, we needed to update the SearchStrategy interface to separate the task of generating new search blocks and processing blocks, since these will be done by separate workers. This commit updates the interface, this commit the tests, and this commit the implementation.

Finally, once all the rearchitecting and re-testing is done, we ditched our old populate_database entry script, and replaced it with a worker script for generating new search blocks, and one for claiming and processing blocks. These have the same general structure: infinitely loop, use the database and search strategy interfaces as expected, and then exponentially back off when failures occur, eventually quitting if there are too many repeated failures. In the case of the generator worker, we also have configuration around how many new blocks to create, what minimum threshold to trigger doing work, and a delay between checks (since the workers will be much slower than the generator).

Ship it!

Now we’re ready to launch it, and this time since we have improved our memory issues, we can use an AWS instance with smaller memory requirements, but launch four of them to allow one for each container. After a few of the following steps, this worked out just fine:

  • Configure an AWS security group (which limits which IP addresses EC2 instances can communicate with) so that Postgres communication was allowed between all the launched instances. This was made easy by the fact that you can configure a security group once so that it allows communication to and from any other instances with the same security group (in addition to my laptop’s IP).
  • Keep track of the ip addresses of each of the instances I launched, and make sure that the instance with a 60 GiB disk is the instance I put the database on.
  • Manually install docker and launch the right container on each one, and point their PGHOST variables to the address of the database instance.

I will improve this workflow in the next article, but for now I am letting it run for a few days and observing that it works. The generator job adds 100 new search blocks any time there are less than 100 eligible blocks, and this query shows they are getting worked on properly.

divisor=# select state, count(*) from searchmetadata group by state;
    state    | count 
 NOT_STARTED |   129
 FINISHED    |   367
 IN_PROGRESS |     4

As I was stopping and restarting jobs, two search blocks got stuck in the “in_progress” state, as you can see from the lagging-behind start time.

divisor=# select start_time, starting_search_index, ending_search_index from searchmetadata where state='IN_PROGRESS' order by start_time asc;
         start_time         | starting_search_index | ending_search_index 
 2021-02-09 04:46:13.616085 | 64,433492             | 64,683491    <-- stuck!
 2021-02-09 04:48:08.554847 | 65,191862             | 65,441861    <-- stuck!
 2021-02-09 14:22:08.331754 | 78,9803652            | 78,10053651
 2021-02-09 14:22:44.36657  | 78,10053652           | 78,10303651
(4 rows)

This suggests I should create a new job for cleaning up “stale” IN_PROGRESS blocks and marking them as failed. I’ll do this in the next article, and it won’t hurt us to leave some blocks in the transient state, because once the cleanup job is running it will make those blocks eligible to be worked on again, and the worker’s claim the earliest block first.

I also noticed an issue where I forgot a commit statement, which didn’t show up in testing, and I can’t quite figure out how to set up a test that requires this commit statement to be present. Nevertheless, after I noticed the problem, I patched it, stopped the containers, ran git pull on each instance, rebuilt the containers, and started them again, and it continued along as if nothing was wrong.

Since the search application is running well now for long periods of time, I’ll let it run for a week or two (or until something breaks), and return to it in the next article to see what bigger witness values we find. In the mean time, I’d like to discuss how I organized all of the changes that went into that massive pull request.

How to organize a big pull request

The pull request to change the application architecture is quite substantial, and I had a lot of cleanup I wanted to do along the way.

The basic pattern of the PR, which was repeated a few times throughout, was to split each unit of change into three parts: update the interface, update the tests that use the interface, and update the implementation. As an example, one commit updates the SearchStrategy interface, one the tests, and one the implementation.

Doing it this way helps in a few ways. First, I think a lot about how the interface should work, and I prove that it feels right by writing tests first. That way, I don’t get bogged down in the implementation details when thinking about how it will be used. Likewise, when doing this for the database change, after updating the interface and tests, I first updated the InMemoryDatabase implementation, because it was simpler, and having this made me confident the interface could be implemented decently, and allowed me to tweak the interface early, before turning to the harder problem of figuring out PostgreSQL FOR UPDATE stuff.

Much like the proof of a mathematical theorem, the final commits hide the work that went into this, and commit messages, code comments, and PR description provide a deeper explanation. Typically what I do is make changes, and then use git’s staging area concept to pull out different sub-changes within a file into a single commit. This blog post about “using git well” covers this in more detail. But this, and git’s rebase and amend features, allowed me to edit older commits so that the final pull request lays out the commits in a logical order.

It also allows me to split off the cleanup parts as best as possible, such as this commit to rename SearchState to SearchIndex (after I added the new “state” field). I made the commit, had some places I missed and fixed later, extracted them into their own commit, and rebased and squashed them together to unify the change. Keeping each commit as close to an atomic thought as possible (“add this test” or “implement this method”) enables easy squashing and reordering.

Though nobody is reviewing my code now (except you, dear reader, after the fact), a primary benefit of all this cleanup is that the code is much easier to review. The pull request as a whole is a feature, each commit is a smaller, but comprehensible piece of that bigger feature, and the commits are laid out in such a way that if you browse the commits in order, you get a clear picture of how the whole feature was assembled.

If you find yourself doing mathematical work in the context of software, you should expect to spend a lot of time reviewing code and having your code reviewed. The extra effort you put into making that process easier pays off by producing better feedback, uncovering more bugs, and reducing total review time. Though it might be considered a soft skill, smooth code reviews are a critical part of a well-functioning software organization.

Searching for RH Counterexamples — Performance Profiling

We’re ironically searching for counterexamples to the Riemann Hypothesis.

  1. Setting up Pytest
  2. Adding a Database
  3. Search Strategies
  4. Unbounded integers
  5. Deploying with Docker

In the last article we ran into some performance issues with our deployed docker application. In this article we’ll dig in to see what happened, fix the problem, run into another problem, fix it, and run the search until we rule out RH witness-value-based counterexamples among all numbers with fewer than 85 prime factors.

When debugging any issue, there are some straightforward steps to follow.

  1. Gather information to determine where the problem is.
  2. Narrow down the source of the problem.
  3. Reproduce the problem in an isolated environment (on your local machine or a separate EC2 instance).
  4. Fix the problem.

So far what I know is that the search ran for days on EC2 with docker, and didn’t make significantly more progress than when I ran it on my local machine for a few hours.

Gathering information

Gathering information first is an important step that’s easy to skip when you think you know the problem. In my experience, spending an extra 5-10 minutes just looking around can save hours of work you might spend fixing the wrong problem.

So we know the application is slow. But there are many different kinds of slow. We could be using up all the CPU, running low on free RAM, throttled by network speeds, and more. To narrow down which of these might be the problem, we can look at the server’s resource usage.

The CPU utilization EC2 dashboard below shows that after just a few hours of running the application, the average CPU utilization drops to 5-10% and stays at that level. Running docker stats shows that the search container has over 5 GiB of unused RAM. df -h shows that the server has more than 60% of its disk space free.

The fact that the CPU spikes like this suggests that the program is doing its work, just with gaps in between the real work, and some sort of waiting consuming the rest of the time.

Another angle we can look at is the timestamps recorded in the SearchMetadata table.

divisor=# select * from searchmetadata order by start_time desc limit 9;
         start_time         |          end_time          |       search_state_type       | starting_search_state | ending_search_state 
 2021-01-01 15:40:42.374536 | 2021-01-01 16:15:34.838774 | SuperabundantEnumerationIndex | 71,1696047            | 71,1946047
 2021-01-01 15:06:13.947216 | 2021-01-01 15:40:42.313078 | SuperabundantEnumerationIndex | 71,1446047            | 71,1696047
 2021-01-01 14:32:14.692185 | 2021-01-01 15:06:13.880209 | SuperabundantEnumerationIndex | 71,1196047            | 71,1446047
 2021-01-01 13:57:39.725843 | 2021-01-01 14:32:14.635433 | SuperabundantEnumerationIndex | 71,946047             | 71,1196047
 2021-01-01 13:25:53.376243 | 2021-01-01 13:57:39.615891 | SuperabundantEnumerationIndex | 71,696047             | 71,946047
 2021-01-01 12:59:14.58666  | 2021-01-01 13:25:53.331857 | SuperabundantEnumerationIndex | 71,446047             | 71,696047
 2021-01-01 12:43:08.503441 | 2021-01-01 12:59:14.541995 | SuperabundantEnumerationIndex | 71,196047             | 71,446047
 2021-01-01 12:27:49.698012 | 2021-01-01 12:43:08.450301 | SuperabundantEnumerationIndex | 70,4034015            | 71,196047
 2021-01-01 12:14:44.970486 | 2021-01-01 12:27:49.625687 | SuperabundantEnumerationIndex | 70,3784015            | 70,4034015

As you can see, computing a single block of divisor sums takes over a half hour in many cases! This is nice, because we can isolate the computation of a single block on our local machine and time it.

Narrowing Down

Generally, since we ran this application on a local machine just fine for longer than we ran it in docker on EC2, I would think the culprit is related to the new thing that was introduced: docker and/or EC2. My local machine is a Mac, and the EC2 machine was Ubuntu linux, so there could be a difference there. It’s also strange that the system only started to slow down after about two hours, instead of being slow the whole time. That suggests there is something wrong with scaling, i.e., what happens as the application starts to grow in its resource usage.

Just to rule out the possibility that it’s a problem with computing and storing large blocks, let’s re-test the block with starting index 71,196047 and ending index 71,446047, which took approximated 16 minutes on EC2. These three lines are between the start/end timestamps in the table. Only the second line does substantive work.

start_state = search_strategy.search_state()
end_state = search_strategy.search_state()

First we’ll just run the next_batch method, to remove the database upsert from consideration. We’ll also do it in a vanilla Python script, to remove the possibility that docker is introducing inefficiency, which is the most likely culprit, since docker is the new part from the last article. This commit has the timing test and the result shows that on average the block takes two minutes on my local machine.

Running the same code in a docker container on EC2 has the same result, which I verified by copying the divisorsearch.Dockerfile to a new dockerfile and replacing the entrypoint command with

ENTRYPOINT ["python3", "-u", "-m", "timing.ec2_timing_test"]

Then running (on a fresh EC2 instance with docker installed and the repo cloned)

docker build -t timing -f timing.Dockerfile .
docker run -dit --name timing --memory="15G" --env PGHOST="$PGHOST" timing:latest
docker logs -f timing

Once it finishes, I see it takes on average 40 seconds to compute the block. So any prior belief that the divisor computation might be the bottleneck are now gone.

Now let’s try the call to db.upsert, which builds the query and sends it over the network to the database container. Testing that in isolation—by computing the block first and then timing the upsert with this commit —shows it takes about 8 seconds to update an empty database with this block.

Both of these experiments were run on EC2 in docker, so we’ve narrowed down the problem enough to say that it only happens when the system has been running for that two hour window. My best guess is something to do with having a large database, or something to do with container-to-container networking transfer rates (the docker overhead). The next step in narrowing down the problem is to set up the scenario and run performance profiling tools on the system in situ.

I spin back up an r5.large. Then I run the deploy script and wait for two hours. It started slowing down after about 1h45m of runtime. At this point I stopped the divisorsearch container and ran the timing test container. And indeed, the upsert step has a strangely slow pattern (the numbers are in seconds):

Running sample 0
Running sample 1
Running sample 2
245.36470413208008  # 4 minutes
Running sample 3
1683.663686990738   # 28 minutes

At this point I killed the timing test. The initial few writes, while not great, are not as concerning as the massive degradation in performance over four writes. At this point I’m convinced that the database is the problem. So I read a lot about Postgres performance tuning (e.g. these docs and this blog post), which I had never done before.

It appears that the way Postgres inserts work is by writing to a “write cache,” and then later a background process copies the content of the write cache to disk. This makes sense because disk writes are slow. But if the write cache is full, then Postgres will wait until the write cache gets emptied before it can continue. This seems likely here: the first few writes are fast, and then when the cache fills up later writes take eons.

I also learned that the PRIMARY KEY part of our Postgres schema incurs a penalty to write performance, because in order to enforce the uniqueness Postgres needs to maintain a data structure and update it on every write. I can imagine that, because an mpz (unbounded integer) is our primary key, that data structure might be unprepared to handle such large keys. This might explain why these writes are taking so long in the first place, given that each write itself is relatively small (~2 MiB). In our case we can probably get by without a primary key on this table, and instead put any uniqueness constraints needed on the search metadata table.

So let’s change one of these and see if it fixes the problem. First we can drop the primary key constraint, which we can do in situ by running the query

alter table riemanndivisorsums drop constraint divisor_sum_pk;

This query takes a long time, which seems like a good sign. Perhaps it’s deleting a lot of data. Once that’s done, I removed the ON CONFLICT clause from the upsert command (it becomes just an insert command), and then rebuild and re-run the timing container. This shows that each insert takes only 3-4 seconds, which is much more reasonable.

Running sample 0
Running sample 1
Running sample 2
Running sample 3
Running sample 4

I am still curious about the possible fix by increasing the cache size, but my guess is that without the primary key change, increasing the cache size would just delay the application from slowing instead of fixing the problem permanently, so for now I will not touch it. Perhaps a reader with more experience with Postgres tuning would know better.

Updating the application

So with this new understanding, how should we update the application? Will anything go wrong if we delete the primary key on the RiemannDivisorSums table?

As suggested by our need to remove the ON CONFLICT clause, we might end up with duplicate rows. That is low risk, but scarier is if we stop the search mid-block and restart it, then if we’re really unlucky, we might get into a state where we skip a block and don’t notice, and that block contains the unique counterexample to the Riemann Hypothesis! We simply can’t let that happen. This is far too important. To be fair, this is a risk even before we remove the primary key. I’m just realizing it now.

We’ll rearchitect the system to mitigate that in a future post, and it will double as providing a means to scale horizontally. For now, this pull request removes the primary key constraint and does a bit of superficial cleanup. Now let’s re-run it!

Out of RAM again

This time, it ran for three hours before the divisorsearch container hit the 15 GiB RAM limit and crashed. Restarting the container and watching docker stats showed that it plain ran out of RAM. This is a much easier problem to solve because it involves only one container, and the virtual machine doesn’t slow to a crawl when it occurs, the container just stops running.

A quick python memory profile reveals the top 10 memory usages by line, the top being clocking in at about 2.2 GiB.

[ Top 10 ]
riemann/ size=2245 MiB, count=9394253, average=251 B
venv/lib/python3.7/site-packages/llvmlite/ir/ size=29.6 MiB, count=9224, average=3367 B
<string>:2: size=26.7 MiB, count=499846, average=56 B
riemann/ size=19.1 MiB, count=500003, average=40 B
riemann/ size=15.3 MiB, count=250000, average=64 B
riemann/ size=13.4 MiB, count=250000, average=56 B
venv/lib/python3.7/site-packages/llvmlite/ir/ size=3491 KiB, count=6411, average=558 B
venv/lib/python3.7/site-packages/llvmlite/ir/ size=2150 KiB, count=19267, average=114 B
riemann/ size=2066 KiB, count=1, average=2066 KiB
venv/lib/python3.7/site-packages/llvmlite/ir/ size=1135 KiB, count=2018, average=576 B

This line, not surprisingly, is the computation of the full list of partitions of n. The length of this list grows superpolynomially in n, which explains why it takes up all the RAM.

Since we only need to look at at most batch_size different partitions of n at a given time, we can probably fix this by adding a layer of indirection so that new sections of the partition list are generated on the fly and old sections are forgotten once the search strategy is done with them.

This pull request does that, and running the memory test again shows the 2.5 GiB line above reduced to 0.5 GiB. The pull request description also has some thoughts about the compute/memory tradeoff being made by this choice, and thoughts about how to improve the compute side.

So let’s run it again!! This time it runs for about 19 hours before crashing, as shown by the EC2 CPU usage metrics.

The application ran for about 19 hours before crashing.

According to the SearchMetadata table, we got up to part way through n=87

divisor=# select * from searchmetadata order by start_time desc limit 3;
         start_time         |          end_time          |       search_state_type       | starting_search_state | ending_search_state 
 2021-01-27 00:00:11.59719  | 2021-01-27 00:01:14.709715 | SuperabundantEnumerationIndex | 87,15203332           | 87,15453332
 2021-01-26 23:59:01.285508 | 2021-01-27 00:00:11.596075 | SuperabundantEnumerationIndex | 87,14953332           | 87,15203332
 2021-01-26 23:57:59.809282 | 2021-01-26 23:59:01.284257 | SuperabundantEnumerationIndex | 87,14703332           | 87,14953332

Logging into the server and running df -h we can see the reason for the crash is that the disk is full. We filled up a 60 GiB SSD full of Riemann divisor sums. It’s quite an achievement. I’d like to thank my producers, the support staff, my mom, and the Academy.

But seriously, this is the best case scenario. Our limiting factor now is just how much we want to pay for disk space (and/or, any tricks we come up with to reduce disk space).

Analyzing the results, it seems we currently have 949 examples of numbers with witness values bigger than 1.767. Here are the top few:

                                                                                       n                                                                                        |                                                                                   divisor_sum                                                                                   |   witness_value    
 533187564151227457465199401229454876347036513892234205802944360099435118364718466037392872608220305945979716166395732328054742493039981726997486787797703088097204529280000    | 5634790045188254963919691913741193557789227457256740288527114259589826494137632464425981545755572664137227875215269764405574488353138596297944567958732800000000000000000000    | 1.7689901658397056
 1392134632248635659178066321747923959130643639405311242305337754828812319490126543178571468950966856255821713228187290673772173611070448373361584302343872292681996160000      | 14673932409344413968540864358701024890076113169939427834706026717681839828483417876109326942071803812857364258373098344806183563419631761192563979059200000000000000000000      | 1.7688977455109272
 3673178449204843427910465228886342900080853929829317262019360830682882109472629401526573796704398037614305311947723722094385682351109362462695473093255599716839040000         | 38615611603537931496160169365002697079147666236682704828173754520215367969693204937129807742294220560150958574666048275805746219525346739980431523840000000000000000000         | 1.7688303488154073
 2784269264497271318356132643495847918261287278810622484610675509657624638980253086357142937901933712511643426456374581347544347222140896746723168604687744585363992320000      | 29355033324995298095346770663840105972086801871471400577978104254473441181064775868425839681379597759477726740614478613571725301516068423098949435392000000000000000000000      |  1.768798862584946
 9847663402693950208875241900499578820592101688550448423644399009873678577674609655567221975078815114247467324256631962719532660458738237165403413118647720420480000            | 103250298405181635016471041082894911976330658386852151946988648449773711148912312666122480594369573690243204745096385764186487217982210534707036160000000000000000000           | 1.7687597437473672

The best so far achieves 1.76899. Recall out last best number achieved a witness value of 1.7679, a modest improvement if we hope to get to 1.82 (or even the supposedly infinitely many examples with witness value bigger than 1.81!).

What’s next?

We’re not stopping until we disprove the Riemann hypothesis. So strap in, this is going to be a long series.

Since the limiting factor in our search is currently storage space, I’d like to spend some time coming up with a means to avoid storing the witness value and divisor sum for every single number in our search strategy, but still maintain the ability to claim our result (no counterexamples below N) while providing the means for a skeptical person to verify our claims as much as they desire without also storing the entire search space on disk.

Once we free ourselves from disk space constraints, the second limiting factor is that our search uses only a single CPU at a time. We should rearchitect the system so that we can scale horizontally, meaning we can increase the search rate by using more computers. We can do this by turning our system into a “worker” model, in which there is a continually growing queue of search blocks to process, and each worker machine asks for the next block to compute, computes it, and then stores the result in the database. There are some tricks required to ensure the workers don’t step on each other’s toes, but it’s a pretty standard computational model.

There’s also the possibility of analyzing our existing database to try to identify what properties of the top n values make their witness values so large. We could do this by re-computing the prime factorization of each of the numbers and squinting at it very hard. The benefit of doing this would ultimately be to design a new SearchStrategy that might find larger witness values faster than the superabundant enumeration. That approach also has the risk that, without a proof that the new search strategy is exhaustive, we might accidentally skip over the unique counterexample to the Riemann hypothesis!

And then, of course, there’s the business of a front-end and plotting. But I’m having a bit too much fun with the back end stuff to worry about that for now. Protest in the comments if you prefer I work on visualization instead.

Postscript: a false start

Here’s one thing I tried during debugging that turned out to be a dead end: the linux perf tool. It is way more detailed than I can understand, but it provides a means to let me know if the program is stuck in the OS kernel or whether it’s the applications that are slow. I ran the perf tool when the deployed application was in the “good performance” regime (early in its run). I saw something like this:

Samples: 59K of event 'cpu-clock:pppH', Event count (approx.): 119984000000
Overhead  Command          Shared Object                                Symbol
  49.13%  swapper          [kernel.kallsyms]                            [k] native_safe_halt
   8.04%  python3                         [.] _PyEval_EvalFrameDefault
   1.35%  python3                         [.] _PyDict_LoadGlobal
   0.86%  python3                                 [.] realloc
   0.86%  python3                         [.] 0x0000000000139736
   0.77%  python3                         [.] PyTuple_New
   0.74%  python3                         [.] PyType_IsSubtype

It says half the time is spent by something called “swapper” doing something called native_safe_halt, and this in kernel mode ([k]), i.e., run by the OS. The rest of the list is dominated by python3 and postgres doing stuff like _PyDict_LoadGlobal which I assume is useful Python work. When I look up native_safe_halt (finding an explanation is surprisingly hard), I learn that this indicates the system is doing nothing. So 49% of the time the CPU is idle. This fits with good behavior in our case, because each docker container gets one of the two CPUs on the host, and the postgres container is most often doing nothing while waiting for the search container to send it data. This also matches the CPU utilization on the EC2 dashboard. So everything appears in order.

I ran perf again after the application started slowing down, and I saw that native_safe_halt is at 95%. This tells me nothing new, sadly. I also tried running it during the timing test and saw about 40% usage of some symbols like __softirqentry_text_start and __lock_text_start. Google failed to help me understand what that was, so it was a dead end.

Searching for RH Counterexamples — Unbounded Integers

We’re ironically searching for counterexamples to the Riemann Hypothesis.

  1. Setting up Pytest
  2. Adding a Database
  3. Search strategies

In the last article, we improved our naive search from “try all positive integers” to enumerate a subset of integers (superabundant numbers), which RH counterexamples are guaranteed to be among. These numbers grow large, fast, and we quickly reached the limit of what 64 bit integers can store.

Unbounded integer arithmetic is possible on computers, but it requires a special software implementation. In brief, you represent numbers in base-N for some large N (say, 2^{32}), and then use a 32-bit integer for each digit. Arithmetic on such quantities emulates a ripple-carry adder, which naturally requires linear time in the number of digits of each operand. Artem Golubin has a nice explanation of how Python does it internally.

So Python can handle unbounded integer arithmetic, but neither numba nor our database engine do. Those both crash when exceeding 64-bit integers This is a problem because we won’t be able to store the results of our search without being able to put it in a database. This leaves us with a classic software engineering problem. What’s the path forward?

Exploring Alternatives

The impulse answer is to do as little as possible to make the damn thing work. In a situation where the software you’re writing is a prototype, and you expect it to be rewritten from scratch in the future, this is an acceptable attitude. That said, experienced engineers would caution you that, all too often, such “prototypes” are copy-pasted to become janky mission-critical systems for years.

In pretending this is the “real thing,” let’s do what real engineers would do and scope out some alternatives before diving in. The two aspects are our database and the use of numba for performance.

Let’s start with the database. A quick and dirty option: store all numbers as text strings in the database. There’s no limit on the size of the number in that case. The benefit: we don’t need to use a different database engine, and most of our code stays the same. The cost: we can’t use numeric operations in database queries, which would make further analysis and fetching awkward. In particular, we can’t even apply sorting operations, since text strings are sorted lexicographically (e.g., 100, 25) while numbers are sorted by magnitude (25, 100). Note, we applied this “numbers as text” idea to the problem of serializing the search state, and it was hacky there, too.

A second option is to find a database engine with direct support for unbounded-integer arithmetic. The benefit: fast database queries and the confidence that it will support future use cases well. The cost: if our existing sqlite-based interface doesn’t work with the new database engine, we’d have to write another implementation of our database interface.

For numba, we have at least three options. First, fall back to native python arithmetic, which is slow. Second, implement arbitrary-precision arithmetic in Python in a way that numba can compile it. Third, find (or implement) a C-implementation of arbitrary precision integer arithmetic, provide Python bindings, and optionally see if it can work with (or replace) numba. As I write this I haven’t yet tried any of these options. My intuition tells me the best way to go would be to find “proper” support for arbitrary precision integers.

For the database, I recall that the Postgres database engine supports various extensions, for example this extension that adds support for geographic objects. Postgres’s extension framework demonstrates an important software engineering principle that many of the best projects follow: “closed for modification, open for extension.” That is, Postgres is designed so that others can contribute new features to Postgres without requiring the Postgres team to do anything special—specifically, they don’t have to change Postgres to accommodate it. The name for this sometimes goes by extensions, or plug-ins, hooks, or (at a lower level) callbacks. Github Actions is a good example of this.

Geographic objects are almost certainly more complicated than arbitrary precision integers, so chances are good a Postgres extension exists for the latter. Incorporating it would involve migrating to Postgres, finding and installing that extension, and then converting the C library representation above to whatever representation Postgres accepts in a query.

A good route will also ensure that we need not change our tests too much, since all we’re doing here is modifying implementations. We’ll see how well that holds up.

gmp and pgmp

After some digging, I found GMP (GNU Multiple Precision), a C library written by Torbjörn Granlund. It has a Python bindings library called gmpy that allows Python to use an “mpz” (“Multiple Precision \mathbb{Z}“) type as a drop-in replacement for Python integers. And I found a PostgreSQL extension called pgmp. The standard Python library for Postgres is psycopg2, which was written by the same person who wrote pgmp, Daniele Varrazzo.

To start, I ran a timing test of gmpy, which proves to be as fast as numba. This pull request has the details.

It took a small bit of kicking to get pgmp to install, but then I made a test database that uses the new column type mpz and stores the value 2^{513}.

postgres=# create database pgmp_test;
postgres=# \connect pgmp_test;
You are now connected to database "pgmp_test" as user "jeremy".
pgmp_test=# CREATE EXTENSION pgmp;
pgmp_test=# create table test_table (id int4, value mpz);
pgmp_test=# insert into test_table
pgmp_test-# values (1, 2::mpz ^ 513);
pgmp_test=# select * from test_table;
 id |                                                                            value                                                                            
  1 | 26815615859885194199148049996411692254958731641184786755447122887443528060147093953603748596333806855380063716372972101707507765623893139892867298012168192
(1 row)

Now I’m pretty confident this approach will work.

This pull request includes the necessary commits to add a postgres implementation of our database interface, add tests (which is a minor nuisance).

Then this pull request converts the main divisor computation functions to use gmpy, and this final commit converts the main program to use the postgres database.

This exposed one bug, that I wasn’t converting the new mpz types properly in the postgres sql query. This commit fixes it, and this commit adds a regression test to catch that specific error going forward.

Results and next steps

With all that work, I ran the counterexample search for a few hours.

When I stopped it, it had checked all possibly-superabundant numbers whose prime factorizations have at most 75 prime factors, including multiplicity. Since all possible counterexamples to the RH must be superabundant, and all superabundant numbers have the aforementioned special prime factorization, we can say it more simply. I ruled out all positive integers whose prime factorization has at most 75 factors.

The top 10 are:

divisor=# select n, witness_value 
from RiemannDivisorSums 
where witness_value > 1.7 and n > 5040 
order by witness_value desc 
limit 10;
                                                                         n                                                                          |   witness_value    
 7837096340441581730115353880089927210115664131849557062713735873563599357074419807246597145310377220030504976899588686851652680862494751024960000  | 1.7679071291526642
 49445402778811241199465955079431717413978953513246416799455746836363402883750282695562127099750014006501608687063651021146073696293342277760000    |  1.767864530684858
 24722701389405620599732977539715858706989476756623208399727873418181701441875141347781063549875007003250804343531825510573036848146671138880000    |  1.767645098171234
 157972532839652527793820942745788234549453525601426251755449670403716942120607931934703281468849885004797471843653837128262216282087355520000      | 1.7676163327497005
 2149800120817880052150693699105726844086041457097670295628510732015800125380447073720092482597826695934852551611463087875916247664927925120000     |  1.767592584103948
 340743319149633988265884951308257704787637570949980741857118951024504319872800861184634658491755531305674129430416899428332725254891076131520000   |  1.767582883432923
 23511289021324745190346061640269781630346992395548671188141207620690798071223259421739791435931131660091514930698766060554958042587484253074880000 | 1.7674462177172812
 507950266365442211555694349664913937458049921547994378634886400011951582381375986928306371282475514484879330686989829994412271003496320000         | 1.7674395010995763
 78986266419826263896910471372894117274726762800713125877724835201858471060303965967351640734424942502398735921826918564131108141043677760000       | 1.7674104158678667
 6868370993028370773644388815034271067367544591366358771976072626248562700895997040639273107341299348034672688854514657750531142699450240000        | 1.7674059308384011

This is new. We’ve found quite a few numbers that have a better witness value than n = 10080 which achieves ~1.7558. The best is


which achieves ~1.7679. Recall the 1.781 threshold needed to be a RH counterexample. We’re about 50% of the way toward disproving RH. How much more work could it take?

But seriously, what’s next with this project? For one, even though we have some monstrous numbers and their divisor sums and witness values, it’s hard to see the patterns in them through a SQL queries. It would be nicer to make some pretty plots.

I could also take a step back and see what could be improved from a software engineering standpoint. For one, not all parts of the application are tested, and tests aren’t automatically run when I make changes. This enabled the bug above where I didn’t properly convert mpz types before passing them to SQL upsert statements. For two, while I have been using type annotations in some places, they aren’t checked, and the switch to mpz has almost certainly made many of the type hints incorrect. I could fix that and set up a test that type checks.

Finally, in the interest of completeness, I could set up a front end that displays some summary of the data, and then deploy the whole application so that it has a continuously-running background search, along with a website that shows how far along the search is. Based on how long the SQL statement to find the top 10 witness values took, this would also likely require some caching, which fits snugly in the class of typical software engineering problems.

Let me know what you’re interested in.