We’re ironically searching for counterexamples to the Riemann Hypothesis.
- Setting up Pytest
- Adding a Database
- Search Strategies
- Unbounded integers
- Deploying with Docker
- Performance Profiling
Last time we made the audacious choice to remove primary keys from the RiemannDivisorSums
table for performance reasons. To help with that, we will do two things in this post
- Reduce the storage footprint of the whole application (it was 60 GiB when it crashed, and we got up to 84 prime factors).
- Refactor the application into a worker architecture.
The first one is straightforward, but uses a possibly new idea to the mathematician, and the second is a relatively major rearchitecture.
Hashing a search block
Instead of storing every single Riemann divisor sum, we can just store the ones that are interesting, that is, the ones that are larger than some threshold—in our case, 1.767, which had just under a thousand examples in the last post.
That raises a problem. We want to be able to say that we checked all examples up to some limit. We’ve been saying “up to all numbers having at most K prime factors” because that’s how the superabundant search strategy enumerates the candidates. If we don’t store all the witness values, how can we prove to a skeptical observer that we checked what we said we checked?
One approach is to instead store a “summary string” derived from the full set of witness values. This “summary string” should be deterministically reproducible, so that if someone has the metadata about the block of numbers, they can recompute the summary and compare it with our summary.
A simple example summary might be the string-concatenation of all the witness values one after another, like 1.4323,1.5678,0.8792
. But that still takes up a lot of space (each block had 250,000 different witness values). It would be better if we could find a summary that had a fixed size, regardless of how big the input is. Taking a cue from computer science, a hash function does exactly that. Hash functions aren’t perfect, they can have collisions, but the collisions are unlikely enough that, if you tested a hundred random blocks and all your hashes agreed with all of my hashes, you would be confident I checked everything the same as you. That gives you confidence that the rest of my blocks (say, a million of them) were computed correctly.
The hash function we’ll use is called sha256, and we add it in this pull request. Now the database application shouldn’t need a ton of disk space to run. (We’ll confirm this later in the article)
For more background and related topics around hash functions, see Load Balancing and the Power of Hashing and Hashing to Estimate the Size of a Stream.
Worker architecture
Next we want to rearchitect the application to enable two things: better protection against the possibility of duplicating/skipping divisor sum computations, and allowing us to scale up the rate of computation by adding additional computers to the mix. This pull request achieves the goal, but note that it’s a large change, so let’s cover the broad ideas, then zoom in on a few details, and finally discuss the PR’s structure as a whole from the perspective of software engineering.
We start by adding a “state” field to a search block. The state can be one of NOT_STARTED, IN_PROGRESS, FINISHED, FAILED
. Then one worker job (read, container) will be responsible for computing new blocks in the NOT_STARTED
state, and a group of workers jobs will continually “claim” an eligible block, compute the divisor sums, and then mark it as finished and insert the interesting witness values it found.
This suggests a drastic change to the database interface, as shown in this commit, to align with the new “claim/finish” workflow. We replace the existing functions with insert_search_blocks
, claim_next_search_block
, and finish_search_block
. The first two operate only on the SearchMetadata
table, while the third couples together the changes to both tables, by marking a block as finished and inserting the relevant witness values. We’ll see how we ensure these two updates occur atomically in a moment. Also, because of the new coupling of the two tables, I merged the two database interfaces into one.
We need two safety measures to make this work. First, we need to ensure that we don’t accidentally generate multiple duplicate search blocks. We do this by adding a uniqueness constraint on the bounds of a search block at the database level. If a client calls insert_search_blocks
with something that’s already in the database, the database itself will reject it, and the client will be forced to catch the failure and recover.
Second, we need to ensure that two workers don’t claim the same search block. This involves a bit of database wizardry. I had to do a bit of research to make it work, and it took a few attempts. The gist of the problem is that if you want to claim a search block, you need to do two queries. First, you need to look up the oldest eligible search block that you want to claim. Then, you need to mark it as claimed. But if you have multiple workers accessing the database simultaneously, you can get this order of operations
- Worker 1 runs lookup query, gets search block 7
- Worker 2 runs lookup query, also gets search block 7
- Worker 1 runs update query, marks search block 7 as claimed
- Worker 2 runs update query, also marks search block 7 as claimed.
This can even happen if there’s only “one” query, and the lookup step is a subquery. After reading a bit about PostgreSQL’s transaction isolation guarantees, I learned that you can “lock” the rows read by the subquery until the update step is complete by adding FOR UPDATE
to the subquery. This commit has the full query, but it looks something like this
UPDATE SearchMetadata
SET state = 'IN_PROGRESS'
FROM (
SELECT starting_search_index
FROM SearchMetadata
WHERE state = 'NOT_STARTED'
ORDER BY creation_time ASC
LIMIT 1
FOR UPDATE <---- key clause!
) as m
WHERE
starting_search_index = m.starting_search_index;
I also added some special tests (that use the finished worker jobs) that will specifically fail when this FOR UPDATE
clause is removed, to confirm it fixes the behavior. This is particularly important because it’s hard to test database queries that protect against race conditions. Plus, I tried other approaches and got it wrong, which proved the value of having such tests.
Finally, the finish_search_block
function needs to ensure that the update of the SearchMetadata
and RiemannDivisorSums
tables happen atomically. This is done in this commit by leveraging the concept of a database transaction. In a transaction, all of the queries between BEGIN
and COMMIT
happen at once or not at all, e.g., in the case that a later query fails. The “not at all” case is called a rollback.
Thankfully, the psycopg2
library we’re using uses transactions by default, and requires us to call connection.commit
to make any query persist. So we are already using transactions and get the guarantee without extra work.
Next, we needed to update the SearchStrategy
interface to separate the task of generating new search blocks and processing blocks, since these will be done by separate workers. This commit updates the interface, this commit the tests, and this commit the implementation.
Finally, once all the rearchitecting and re-testing is done, we ditched our old populate_database
entry script, and replaced it with a worker script for generating new search blocks, and one for claiming and processing blocks. These have the same general structure: infinitely loop, use the database and search strategy interfaces as expected, and then exponentially back off when failures occur, eventually quitting if there are too many repeated failures. In the case of the generator worker, we also have configuration around how many new blocks to create, what minimum threshold to trigger doing work, and a delay between checks (since the workers will be much slower than the generator).
Ship it!
Now we’re ready to launch it, and this time since we have improved our memory issues, we can use an AWS instance with smaller memory requirements, but launch four of them to allow one for each container. After a few of the following steps, this worked out just fine:
- Configure an AWS security group (which limits which IP addresses EC2 instances can communicate with) so that Postgres communication was allowed between all the launched instances. This was made easy by the fact that you can configure a security group once so that it allows communication to and from any other instances with the same security group (in addition to my laptop’s IP).
- Keep track of the ip addresses of each of the instances I launched, and make sure that the instance with a 60 GiB disk is the instance I put the database on.
- Manually install docker and launch the right container on each one, and point their
PGHOST
variables to the address of the database instance.
I will improve this workflow in the next article, but for now I am letting it run for a few days and observing that it works. The generator job adds 100 new search blocks any time there are less than 100 eligible blocks, and this query shows they are getting worked on properly.
divisor=# select state, count(*) from searchmetadata group by state;
state | count
-------------+-------
NOT_STARTED | 129
FINISHED | 367
IN_PROGRESS | 4
As I was stopping and restarting jobs, two search blocks got stuck in the “in_progress” state, as you can see from the lagging-behind start time.
divisor=# select start_time, starting_search_index, ending_search_index from searchmetadata where state='IN_PROGRESS' order by start_time asc;
start_time | starting_search_index | ending_search_index
----------------------------+-----------------------+---------------------
2021-02-09 04:46:13.616085 | 64,433492 | 64,683491 <-- stuck!
2021-02-09 04:48:08.554847 | 65,191862 | 65,441861 <-- stuck!
2021-02-09 14:22:08.331754 | 78,9803652 | 78,10053651
2021-02-09 14:22:44.36657 | 78,10053652 | 78,10303651
(4 rows)
This suggests I should create a new job for cleaning up “stale” IN_PROGRESS
blocks and marking them as failed. I’ll do this in the next article, and it won’t hurt us to leave some blocks in the transient state, because once the cleanup job is running it will make those blocks eligible to be worked on again, and the worker’s claim the earliest block first.
I also noticed an issue where I forgot a commit statement, which didn’t show up in testing, and I can’t quite figure out how to set up a test that requires this commit statement to be present. Nevertheless, after I noticed the problem, I patched it, stopped the containers, ran git pull
on each instance, rebuilt the containers, and started them again, and it continued along as if nothing was wrong.
Since the search application is running well now for long periods of time, I’ll let it run for a week or two (or until something breaks), and return to it in the next article to see what bigger witness values we find. In the mean time, I’d like to discuss how I organized all of the changes that went into that massive pull request.
How to organize a big pull request
The pull request to change the application architecture is quite substantial, and I had a lot of cleanup I wanted to do along the way.
The basic pattern of the PR, which was repeated a few times throughout, was to split each unit of change into three parts: update the interface, update the tests that use the interface, and update the implementation. As an example, one commit updates the SearchStrategy interface, one the tests, and one the implementation.
Doing it this way helps in a few ways. First, I think a lot about how the interface should work, and I prove that it feels right by writing tests first. That way, I don’t get bogged down in the implementation details when thinking about how it will be used. Likewise, when doing this for the database change, after updating the interface and tests, I first updated the InMemoryDatabase
implementation, because it was simpler, and having this made me confident the interface could be implemented decently, and allowed me to tweak the interface early, before turning to the harder problem of figuring out PostgreSQL FOR UPDATE
stuff.
Much like the proof of a mathematical theorem, the final commits hide the work that went into this, and commit messages, code comments, and PR description provide a deeper explanation. Typically what I do is make changes, and then use git’s staging area concept to pull out different sub-changes within a file into a single commit. This blog post about “using git well” covers this in more detail. But this, and git’s rebase and amend features, allowed me to edit older commits so that the final pull request lays out the commits in a logical order.
It also allows me to split off the cleanup parts as best as possible, such as this commit to rename SearchState to SearchIndex (after I added the new “state” field). I made the commit, had some places I missed and fixed later, extracted them into their own commit, and rebased and squashed them together to unify the change. Keeping each commit as close to an atomic thought as possible (“add this test” or “implement this method”) enables easy squashing and reordering.
Though nobody is reviewing my code now (except you, dear reader, after the fact), a primary benefit of all this cleanup is that the code is much easier to review. The pull request as a whole is a feature, each commit is a smaller, but comprehensible piece of that bigger feature, and the commits are laid out in such a way that if you browse the commits in order, you get a clear picture of how the whole feature was assembled.
If you find yourself doing mathematical work in the context of software, you should expect to spend a lot of time reviewing code and having your code reviewed. The extra effort you put into making that process easier pays off by producing better feedback, uncovering more bugs, and reducing total review time. Though it might be considered a soft skill, smooth code reviews are a critical part of a well-functioning software organization.
Want to respond? Send me an email, post a webmention, or find me elsewhere on the internet.