Searching for RH Counterexamples — Scaling Up

We’re ironically searching for counterexamples to the Riemann Hypothesis.

Last time we made the audacious choice to remove primary keys from the RiemannDivisorSums table for performance reasons. To help with that, we will do two things in this post

• Reduce the storage footprint of the whole application (it was 60 GiB when it crashed, and we got up to 84 prime factors).
• Refactor the application into a worker architecture.

The first one is straightforward, but uses a possibly new idea to the mathematician, and the second is a relatively major rearchitecture.

Hashing a search block

Instead of storing every single Riemann divisor sum, we can just store the ones that are interesting, that is, the ones that are larger than some threshold—in our case, 1.767, which had just under a thousand examples in the last post.

That raises a problem. We want to be able to say that we checked all examples up to some limit. We’ve been saying “up to all numbers having at most K prime factors” because that’s how the superabundant search strategy enumerates the candidates. If we don’t store all the witness values, how can we prove to a skeptical observer that we checked what we said we checked?

One approach is to instead store a “summary string” derived from the full set of witness values. This “summary string” should be deterministically reproducible, so that if someone has the metadata about the block of numbers, they can recompute the summary and compare it with our summary.

A simple example summary might be the string-concatenation of all the witness values one after another, like 1.4323,1.5678,0.8792. But that still takes up a lot of space (each block had 250,000 different witness values). It would be better if we could find a summary that had a fixed size, regardless of how big the input is. Taking a cue from computer science, a hash function does exactly that. Hash functions aren’t perfect, they can have collisions, but the collisions are unlikely enough that, if you tested a hundred random blocks and all your hashes agreed with all of my hashes, you would be confident I checked everything the same as you. That gives you confidence that the rest of my blocks (say, a million of them) were computed correctly.

The hash function we’ll use is called sha256, and we add it in this pull request. Now the database application shouldn’t need a ton of disk space to run. (We’ll confirm this later in the article)

For more background and related topics around hash functions, see Load Balancing and the Power of Hashing and Hashing to Estimate the Size of a Stream.

Worker architecture

Next we want to rearchitect the application to enable two things: better protection against the possibility of duplicating/skipping divisor sum computations, and allowing us to scale up the rate of computation by adding additional computers to the mix. This pull request achieves the goal, but note that it’s a large change, so let’s cover the broad ideas, then zoom in on a few details, and finally discuss the PR’s structure as a whole from the perspective of software engineering.

We start by adding a “state” field to a search block. The state can be one of NOT_STARTED, IN_PROGRESS, FINISHED, FAILED. Then one worker job (read, container) will be responsible for computing new blocks in the NOT_STARTED state, and a group of workers jobs will continually “claim” an eligible block, compute the divisor sums, and then mark it as finished and insert the interesting witness values it found.

This suggests a drastic change to the database interface, as shown in this commit, to align with the new “claim/finish” workflow. We replace the existing functions with insert_search_blocks, claim_next_search_block, and finish_search_block. The first two operate only on the SearchMetadata table, while the third couples together the changes to both tables, by marking a block as finished and inserting the relevant witness values. We’ll see how we ensure these two updates occur atomically in a moment. Also, because of the new coupling of the two tables, I merged the two database interfaces into one.

We need two safety measures to make this work. First, we need to ensure that we don’t accidentally generate multiple duplicate search blocks. We do this by adding a uniqueness constraint on the bounds of a search block at the database level. If a client calls insert_search_blocks with something that’s already in the database, the database itself will reject it, and the client will be forced to catch the failure and recover.

Second, we need to ensure that two workers don’t claim the same search block. This involves a bit of database wizardry. I had to do a bit of research to make it work, and it took a few attempts. The gist of the problem is that if you want to claim a search block, you need to do two queries. First, you need to look up the oldest eligible search block that you want to claim. Then, you need to mark it as claimed. But if you have multiple workers accessing the database simultaneously, you can get this order of operations

1. Worker 1 runs lookup query, gets search block 7
2. Worker 2 runs lookup query, also gets search block 7
3. Worker 1 runs update query, marks search block 7 as claimed
4. Worker 2 runs update query, also marks search block 7 as claimed.

This can even happen if there’s only “one” query, and the lookup step is a subquery. After reading a bit about PostgreSQL’s transaction isolation guarantees, I learned that you can “lock” the rows read by the subquery until the update step is complete by adding FOR UPDATE to the subquery. This commit has the full query, but it looks something like this

UPDATE SearchMetadata
SET state = 'IN_PROGRESS'
FROM (
SELECT starting_search_index
WHERE state = 'NOT_STARTED'
ORDER BY creation_time ASC
LIMIT 1
FOR UPDATE         <---- key clause!
) as m
WHERE
starting_search_index = m.starting_search_index;


I also added some special tests (that use the finished worker jobs) that will specifically fail when this FOR UPDATE clause is removed, to confirm it fixes the behavior. This is particularly important because it’s hard to test database queries that protect against race conditions. Plus, I tried other approaches and got it wrong, which proved the value of having such tests.

Finally, the finish_search_block function needs to ensure that the update of the SearchMetadata and RiemannDivisorSums tables happen atomically. This is done in this commit by leveraging the concept of a database transaction. In a transaction, all of the queries between BEGIN and COMMIT happen at once or not at all, e.g., in the case that a later query fails. The “not at all” case is called a rollback.

Thankfully, the psycopg2 library we’re using uses transactions by default, and requires us to call connection.commit to make any query persist. So we are already using transactions and get the guarantee without extra work.

Next, we needed to update the SearchStrategy interface to separate the task of generating new search blocks and processing blocks, since these will be done by separate workers. This commit updates the interface, this commit the tests, and this commit the implementation.

Finally, once all the rearchitecting and re-testing is done, we ditched our old populate_database entry script, and replaced it with a worker script for generating new search blocks, and one for claiming and processing blocks. These have the same general structure: infinitely loop, use the database and search strategy interfaces as expected, and then exponentially back off when failures occur, eventually quitting if there are too many repeated failures. In the case of the generator worker, we also have configuration around how many new blocks to create, what minimum threshold to trigger doing work, and a delay between checks (since the workers will be much slower than the generator).

Ship it!

Now we’re ready to launch it, and this time since we have improved our memory issues, we can use an AWS instance with smaller memory requirements, but launch four of them to allow one for each container. After a few of the following steps, this worked out just fine:

• Configure an AWS security group (which limits which IP addresses EC2 instances can communicate with) so that Postgres communication was allowed between all the launched instances. This was made easy by the fact that you can configure a security group once so that it allows communication to and from any other instances with the same security group (in addition to my laptop’s IP).
• Keep track of the ip addresses of each of the instances I launched, and make sure that the instance with a 60 GiB disk is the instance I put the database on.
• Manually install docker and launch the right container on each one, and point their PGHOST variables to the address of the database instance.

I will improve this workflow in the next article, but for now I am letting it run for a few days and observing that it works. The generator job adds 100 new search blocks any time there are less than 100 eligible blocks, and this query shows they are getting worked on properly.

divisor=# select state, count(*) from searchmetadata group by state;
state    | count
-------------+-------
NOT_STARTED |   129
FINISHED    |   367
IN_PROGRESS |     4


As I was stopping and restarting jobs, two search blocks got stuck in the “in_progress” state, as you can see from the lagging-behind start time.

divisor=# select start_time, starting_search_index, ending_search_index from searchmetadata where state='IN_PROGRESS' order by start_time asc;
start_time         | starting_search_index | ending_search_index
----------------------------+-----------------------+---------------------
2021-02-09 04:46:13.616085 | 64,433492             | 64,683491    <-- stuck!
2021-02-09 04:48:08.554847 | 65,191862             | 65,441861    <-- stuck!
2021-02-09 14:22:08.331754 | 78,9803652            | 78,10053651
2021-02-09 14:22:44.36657  | 78,10053652           | 78,10303651
(4 rows)


This suggests I should create a new job for cleaning up “stale” IN_PROGRESS blocks and marking them as failed. I’ll do this in the next article, and it won’t hurt us to leave some blocks in the transient state, because once the cleanup job is running it will make those blocks eligible to be worked on again, and the worker’s claim the earliest block first.

I also noticed an issue where I forgot a commit statement, which didn’t show up in testing, and I can’t quite figure out how to set up a test that requires this commit statement to be present. Nevertheless, after I noticed the problem, I patched it, stopped the containers, ran git pull on each instance, rebuilt the containers, and started them again, and it continued along as if nothing was wrong.

Since the search application is running well now for long periods of time, I’ll let it run for a week or two (or until something breaks), and return to it in the next article to see what bigger witness values we find. In the mean time, I’d like to discuss how I organized all of the changes that went into that massive pull request.

How to organize a big pull request

The pull request to change the application architecture is quite substantial, and I had a lot of cleanup I wanted to do along the way.

The basic pattern of the PR, which was repeated a few times throughout, was to split each unit of change into three parts: update the interface, update the tests that use the interface, and update the implementation. As an example, one commit updates the SearchStrategy interface, one the tests, and one the implementation.

Doing it this way helps in a few ways. First, I think a lot about how the interface should work, and I prove that it feels right by writing tests first. That way, I don’t get bogged down in the implementation details when thinking about how it will be used. Likewise, when doing this for the database change, after updating the interface and tests, I first updated the InMemoryDatabase implementation, because it was simpler, and having this made me confident the interface could be implemented decently, and allowed me to tweak the interface early, before turning to the harder problem of figuring out PostgreSQL FOR UPDATE stuff.

Much like the proof of a mathematical theorem, the final commits hide the work that went into this, and commit messages, code comments, and PR description provide a deeper explanation. Typically what I do is make changes, and then use git’s staging area concept to pull out different sub-changes within a file into a single commit. This blog post about “using git well” covers this in more detail. But this, and git’s rebase and amend features, allowed me to edit older commits so that the final pull request lays out the commits in a logical order.

It also allows me to split off the cleanup parts as best as possible, such as this commit to rename SearchState to SearchIndex (after I added the new “state” field). I made the commit, had some places I missed and fixed later, extracted them into their own commit, and rebased and squashed them together to unify the change. Keeping each commit as close to an atomic thought as possible (“add this test” or “implement this method”) enables easy squashing and reordering.

Though nobody is reviewing my code now (except you, dear reader, after the fact), a primary benefit of all this cleanup is that the code is much easier to review. The pull request as a whole is a feature, each commit is a smaller, but comprehensible piece of that bigger feature, and the commits are laid out in such a way that if you browse the commits in order, you get a clear picture of how the whole feature was assembled.

If you find yourself doing mathematical work in the context of software, you should expect to spend a lot of time reviewing code and having your code reviewed. The extra effort you put into making that process easier pays off by producing better feedback, uncovering more bugs, and reducing total review time. Though it might be considered a soft skill, smooth code reviews are a critical part of a well-functioning software organization.

Searching for RH Counterexamples — Performance Profiling

We’re ironically searching for counterexamples to the Riemann Hypothesis.

In the last article we ran into some performance issues with our deployed docker application. In this article we’ll dig in to see what happened, fix the problem, run into another problem, fix it, and run the search until we rule out RH witness-value-based counterexamples among all numbers with fewer than 85 prime factors.

When debugging any issue, there are some straightforward steps to follow.

1. Gather information to determine where the problem is.
2. Narrow down the source of the problem.
3. Reproduce the problem in an isolated environment (on your local machine or a separate EC2 instance).
4. Fix the problem.

So far what I know is that the search ran for days on EC2 with docker, and didn’t make significantly more progress than when I ran it on my local machine for a few hours.

Gathering information

Gathering information first is an important step that’s easy to skip when you think you know the problem. In my experience, spending an extra 5-10 minutes just looking around can save hours of work you might spend fixing the wrong problem.

So we know the application is slow. But there are many different kinds of slow. We could be using up all the CPU, running low on free RAM, throttled by network speeds, and more. To narrow down which of these might be the problem, we can look at the server’s resource usage.

The CPU utilization EC2 dashboard below shows that after just a few hours of running the application, the average CPU utilization drops to 5-10% and stays at that level. Running docker stats shows that the search container has over 5 GiB of unused RAM. df -h shows that the server has more than 60% of its disk space free.

The fact that the CPU spikes like this suggests that the program is doing its work, just with gaps in between the real work, and some sort of waiting consuming the rest of the time.

Another angle we can look at is the timestamps recorded in the SearchMetadata table.

divisor=# select * from searchmetadata order by start_time desc limit 9;
start_time         |          end_time          |       search_state_type       | starting_search_state | ending_search_state
----------------------------+----------------------------+-------------------------------+-----------------------+---------------------
2021-01-01 15:40:42.374536 | 2021-01-01 16:15:34.838774 | SuperabundantEnumerationIndex | 71,1696047            | 71,1946047
2021-01-01 15:06:13.947216 | 2021-01-01 15:40:42.313078 | SuperabundantEnumerationIndex | 71,1446047            | 71,1696047
2021-01-01 14:32:14.692185 | 2021-01-01 15:06:13.880209 | SuperabundantEnumerationIndex | 71,1196047            | 71,1446047
2021-01-01 13:57:39.725843 | 2021-01-01 14:32:14.635433 | SuperabundantEnumerationIndex | 71,946047             | 71,1196047
2021-01-01 13:25:53.376243 | 2021-01-01 13:57:39.615891 | SuperabundantEnumerationIndex | 71,696047             | 71,946047
2021-01-01 12:59:14.58666  | 2021-01-01 13:25:53.331857 | SuperabundantEnumerationIndex | 71,446047             | 71,696047
2021-01-01 12:43:08.503441 | 2021-01-01 12:59:14.541995 | SuperabundantEnumerationIndex | 71,196047             | 71,446047
2021-01-01 12:27:49.698012 | 2021-01-01 12:43:08.450301 | SuperabundantEnumerationIndex | 70,4034015            | 71,196047
2021-01-01 12:14:44.970486 | 2021-01-01 12:27:49.625687 | SuperabundantEnumerationIndex | 70,3784015            | 70,4034015


As you can see, computing a single block of divisor sums takes over a half hour in many cases! This is nice, because we can isolate the computation of a single block on our local machine and time it.

Narrowing Down

Generally, since we ran this application on a local machine just fine for longer than we ran it in docker on EC2, I would think the culprit is related to the new thing that was introduced: docker and/or EC2. My local machine is a Mac, and the EC2 machine was Ubuntu linux, so there could be a difference there. It’s also strange that the system only started to slow down after about two hours, instead of being slow the whole time. That suggests there is something wrong with scaling, i.e., what happens as the application starts to grow in its resource usage.

Just to rule out the possibility that it’s a problem with computing and storing large blocks, let’s re-test the block with starting index 71,196047 and ending index 71,446047, which took approximated 16 minutes on EC2. These three lines are between the start/end timestamps in the table. Only the second line does substantive work.

start_state = search_strategy.search_state()
db.upsert(search_strategy.next_batch(batch_size))
end_state = search_strategy.search_state()


First we’ll just run the next_batch method, to remove the database upsert from consideration. We’ll also do it in a vanilla Python script, to remove the possibility that docker is introducing inefficiency, which is the most likely culprit, since docker is the new part from the last article. This commit has the timing test and the result shows that on average the block takes two minutes on my local machine.

Running the same code in a docker container on EC2 has the same result, which I verified by copying the divisorsearch.Dockerfile to a new dockerfile and replacing the entrypoint command with

ENTRYPOINT ["python3", "-u", "-m", "timing.ec2_timing_test"]


Then running (on a fresh EC2 instance with docker installed and the repo cloned)

docker build -t timing -f timing.Dockerfile .
docker run -dit --name timing --memory="15G" --env PGHOST="$PGHOST" timing:latest docker logs -f timing  Once it finishes, I see it takes on average 40 seconds to compute the block. So any prior belief that the divisor computation might be the bottleneck are now gone. Now let’s try the call to db.upsert, which builds the query and sends it over the network to the database container. Testing that in isolation—by computing the block first and then timing the upsert with this commit —shows it takes about 8 seconds to update an empty database with this block. Both of these experiments were run on EC2 in docker, so we’ve narrowed down the problem enough to say that it only happens when the system has been running for that two hour window. My best guess is something to do with having a large database, or something to do with container-to-container networking transfer rates (the docker overhead). The next step in narrowing down the problem is to set up the scenario and run performance profiling tools on the system in situ. I spin back up an r5.large. Then I run the deploy script and wait for two hours. It started slowing down after about 1h45m of runtime. At this point I stopped the divisorsearch container and ran the timing test container. And indeed, the upsert step has a strangely slow pattern (the numbers are in seconds): Running sample 0 55.63926291465759 Running sample 1 61.28182792663574 Running sample 2 245.36470413208008 # 4 minutes Running sample 3 1683.663686990738 # 28 minutes  At this point I killed the timing test. The initial few writes, while not great, are not as concerning as the massive degradation in performance over four writes. At this point I’m convinced that the database is the problem. So I read a lot about Postgres performance tuning (e.g. these docs and this blog post), which I had never done before. It appears that the way Postgres inserts work is by writing to a “write cache,” and then later a background process copies the content of the write cache to disk. This makes sense because disk writes are slow. But if the write cache is full, then Postgres will wait until the write cache gets emptied before it can continue. This seems likely here: the first few writes are fast, and then when the cache fills up later writes take eons. I also learned that the PRIMARY KEY part of our Postgres schema incurs a penalty to write performance, because in order to enforce the uniqueness Postgres needs to maintain a data structure and update it on every write. I can imagine that, because an mpz (unbounded integer) is our primary key, that data structure might be unprepared to handle such large keys. This might explain why these writes are taking so long in the first place, given that each write itself is relatively small (~2 MiB). In our case we can probably get by without a primary key on this table, and instead put any uniqueness constraints needed on the search metadata table. So let’s change one of these and see if it fixes the problem. First we can drop the primary key constraint, which we can do in situ by running the query alter table riemanndivisorsums drop constraint divisor_sum_pk;  This query takes a long time, which seems like a good sign. Perhaps it’s deleting a lot of data. Once that’s done, I removed the ON CONFLICT clause from the upsert command (it becomes just an insert command), and then rebuild and re-run the timing container. This shows that each insert takes only 3-4 seconds, which is much more reasonable. Running sample 0 3.8325071334838867 Running sample 1 3.7897982597351074 Running sample 2 3.7978150844573975 Running sample 3 3.810023784637451 Running sample 4 3.8057897090911865  I am still curious about the possible fix by increasing the cache size, but my guess is that without the primary key change, increasing the cache size would just delay the application from slowing instead of fixing the problem permanently, so for now I will not touch it. Perhaps a reader with more experience with Postgres tuning would know better. Updating the application So with this new understanding, how should we update the application? Will anything go wrong if we delete the primary key on the RiemannDivisorSums table? As suggested by our need to remove the ON CONFLICT clause, we might end up with duplicate rows. That is low risk, but scarier is if we stop the search mid-block and restart it, then if we’re really unlucky, we might get into a state where we skip a block and don’t notice, and that block contains the unique counterexample to the Riemann Hypothesis! We simply can’t let that happen. This is far too important. To be fair, this is a risk even before we remove the primary key. I’m just realizing it now. We’ll rearchitect the system to mitigate that in a future post, and it will double as providing a means to scale horizontally. For now, this pull request removes the primary key constraint and does a bit of superficial cleanup. Now let’s re-run it! Out of RAM again This time, it ran for three hours before the divisorsearch container hit the 15 GiB RAM limit and crashed. Restarting the container and watching docker stats showed that it plain ran out of RAM. This is a much easier problem to solve because it involves only one container, and the virtual machine doesn’t slow to a crawl when it occurs, the container just stops running. A quick python memory profile reveals the top 10 memory usages by line, the top being search_strategy.py:119 clocking in at about 2.2 GiB. [ Top 10 ] riemann/search_strategy.py:119: size=2245 MiB, count=9394253, average=251 B venv/lib/python3.7/site-packages/llvmlite/ir/values.py:224: size=29.6 MiB, count=9224, average=3367 B <string>:2: size=26.7 MiB, count=499846, average=56 B riemann/superabundant.py:67: size=19.1 MiB, count=500003, average=40 B riemann/superabundant.py:69: size=15.3 MiB, count=250000, average=64 B riemann/superabundant.py:70: size=13.4 MiB, count=250000, average=56 B venv/lib/python3.7/site-packages/llvmlite/ir/_utils.py:48: size=3491 KiB, count=6411, average=558 B venv/lib/python3.7/site-packages/llvmlite/ir/values.py:215: size=2150 KiB, count=19267, average=114 B riemann/search_strategy.py:109: size=2066 KiB, count=1, average=2066 KiB venv/lib/python3.7/site-packages/llvmlite/ir/_utils.py:58: size=1135 KiB, count=2018, average=576 B  This line, not surprisingly, is the computation of the full list of partitions of $n$. The length of this list grows superpolynomially in $n$, which explains why it takes up all the RAM. Since we only need to look at at most batch_size different partitions of $n$ at a given time, we can probably fix this by adding a layer of indirection so that new sections of the partition list are generated on the fly and old sections are forgotten once the search strategy is done with them. This pull request does that, and running the memory test again shows the 2.5 GiB line above reduced to 0.5 GiB. The pull request description also has some thoughts about the compute/memory tradeoff being made by this choice, and thoughts about how to improve the compute side. So let’s run it again!! This time it runs for about 19 hours before crashing, as shown by the EC2 CPU usage metrics. According to the SearchMetadata table, we got up to part way through $n=87$ divisor=# select * from searchmetadata order by start_time desc limit 3; start_time | end_time | search_state_type | starting_search_state | ending_search_state ----------------------------+----------------------------+-------------------------------+-----------------------+--------------------- 2021-01-27 00:00:11.59719 | 2021-01-27 00:01:14.709715 | SuperabundantEnumerationIndex | 87,15203332 | 87,15453332 2021-01-26 23:59:01.285508 | 2021-01-27 00:00:11.596075 | SuperabundantEnumerationIndex | 87,14953332 | 87,15203332 2021-01-26 23:57:59.809282 | 2021-01-26 23:59:01.284257 | SuperabundantEnumerationIndex | 87,14703332 | 87,14953332  Logging into the server and running df -h we can see the reason for the crash is that the disk is full. We filled up a 60 GiB SSD full of Riemann divisor sums. It’s quite an achievement. I’d like to thank my producers, the support staff, my mom, and the Academy. But seriously, this is the best case scenario. Our limiting factor now is just how much we want to pay for disk space (and/or, any tricks we come up with to reduce disk space). Analyzing the results, it seems we currently have 949 examples of numbers with witness values bigger than 1.767. Here are the top few:  n | divisor_sum | witness_value --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------- 533187564151227457465199401229454876347036513892234205802944360099435118364718466037392872608220305945979716166395732328054742493039981726997486787797703088097204529280000 | 5634790045188254963919691913741193557789227457256740288527114259589826494137632464425981545755572664137227875215269764405574488353138596297944567958732800000000000000000000 | 1.7689901658397056 1392134632248635659178066321747923959130643639405311242305337754828812319490126543178571468950966856255821713228187290673772173611070448373361584302343872292681996160000 | 14673932409344413968540864358701024890076113169939427834706026717681839828483417876109326942071803812857364258373098344806183563419631761192563979059200000000000000000000 | 1.7688977455109272 3673178449204843427910465228886342900080853929829317262019360830682882109472629401526573796704398037614305311947723722094385682351109362462695473093255599716839040000 | 38615611603537931496160169365002697079147666236682704828173754520215367969693204937129807742294220560150958574666048275805746219525346739980431523840000000000000000000 | 1.7688303488154073 2784269264497271318356132643495847918261287278810622484610675509657624638980253086357142937901933712511643426456374581347544347222140896746723168604687744585363992320000 | 29355033324995298095346770663840105972086801871471400577978104254473441181064775868425839681379597759477726740614478613571725301516068423098949435392000000000000000000000 | 1.768798862584946 9847663402693950208875241900499578820592101688550448423644399009873678577674609655567221975078815114247467324256631962719532660458738237165403413118647720420480000 | 103250298405181635016471041082894911976330658386852151946988648449773711148912312666122480594369573690243204745096385764186487217982210534707036160000000000000000000 | 1.7687597437473672  The best so far achieves 1.76899. Recall out last best number achieved a witness value of 1.7679, a modest improvement if we hope to get to 1.82 (or even the supposedly infinitely many examples with witness value bigger than 1.81!). What’s next? We’re not stopping until we disprove the Riemann hypothesis. So strap in, this is going to be a long series. Since the limiting factor in our search is currently storage space, I’d like to spend some time coming up with a means to avoid storing the witness value and divisor sum for every single number in our search strategy, but still maintain the ability to claim our result (no counterexamples below N) while providing the means for a skeptical person to verify our claims as much as they desire without also storing the entire search space on disk. Once we free ourselves from disk space constraints, the second limiting factor is that our search uses only a single CPU at a time. We should rearchitect the system so that we can scale horizontally, meaning we can increase the search rate by using more computers. We can do this by turning our system into a “worker” model, in which there is a continually growing queue of search blocks to process, and each worker machine asks for the next block to compute, computes it, and then stores the result in the database. There are some tricks required to ensure the workers don’t step on each other’s toes, but it’s a pretty standard computational model. There’s also the possibility of analyzing our existing database to try to identify what properties of the top $n$ values make their witness values so large. We could do this by re-computing the prime factorization of each of the numbers and squinting at it very hard. The benefit of doing this would ultimately be to design a new SearchStrategy that might find larger witness values faster than the superabundant enumeration. That approach also has the risk that, without a proof that the new search strategy is exhaustive, we might accidentally skip over the unique counterexample to the Riemann hypothesis! And then, of course, there’s the business of a front-end and plotting. But I’m having a bit too much fun with the back end stuff to worry about that for now. Protest in the comments if you prefer I work on visualization instead. Postscript: a false start Here’s one thing I tried during debugging that turned out to be a dead end: the linux perf tool. It is way more detailed than I can understand, but it provides a means to let me know if the program is stuck in the OS kernel or whether it’s the applications that are slow. I ran the perf tool when the deployed application was in the “good performance” regime (early in its run). I saw something like this: Samples: 59K of event 'cpu-clock:pppH', Event count (approx.): 119984000000 Overhead Command Shared Object Symbol 49.13% swapper [kernel.kallsyms] [k] native_safe_halt 8.04% python3 libpython3.7m.so.1.0 [.] _PyEval_EvalFrameDefault 1.35% python3 libpython3.7m.so.1.0 [.] _PyDict_LoadGlobal 0.86% python3 libc-2.28.so [.] realloc 0.86% python3 libpython3.7m.so.1.0 [.] 0x0000000000139736 0.77% python3 libpython3.7m.so.1.0 [.] PyTuple_New 0.74% python3 libpython3.7m.so.1.0 [.] PyType_IsSubtype  It says half the time is spent by something called “swapper” doing something called native_safe_halt, and this in kernel mode ([k]), i.e., run by the OS. The rest of the list is dominated by python3 and postgres doing stuff like _PyDict_LoadGlobal which I assume is useful Python work. When I look up native_safe_halt (finding an explanation is surprisingly hard), I learn that this indicates the system is doing nothing. So 49% of the time the CPU is idle. This fits with good behavior in our case, because each docker container gets one of the two CPUs on the host, and the postgres container is most often doing nothing while waiting for the search container to send it data. This also matches the CPU utilization on the EC2 dashboard. So everything appears in order. I ran perf again after the application started slowing down, and I saw that native_safe_halt is at 95%. This tells me nothing new, sadly. I also tried running it during the timing test and saw about 40% usage of some symbols like __softirqentry_text_start and __lock_text_start. Google failed to help me understand what that was, so it was a dead end. Searching for RH Counterexamples — Deploying with Docker We’re ironically searching for counterexamples to the Riemann Hypothesis. In this article we’ll deploy the application on a server, so that it can search for RH counterexamples even when I close my laptop. Servers and containers When deploying applications to servers, reproducibility is crucial. You don’t want your application to depend on the details of the computer it’s running on. This is a higher-level version of the same principle behind Python virtual environments, but it applies to collections of programs, possibly written in different languages and running on different computers. In our case, we have a postgres database, the pgmp extension, the populate_database program, and plans for a web server. The principle of an application not depending on the system it’s running on is called hermeticity (the noun form of hermetic, meaning air-tight). Hermeticity is good for the following reasons. When a server crashes, you don’t have to remember what you did to set it up. Instead you run a build/install that works on any machine. Newcomers also don’t have to guess what unknown aspects of a running server are sensitive. Another benefit is that you can test it on your local machine identically to how it will run in production. It also allows you to easily migrate from one cloud provider to another, which allows you to defer expensive commitments until you have more information about your application’s needs. Finally, if you have multiple applications running on the same server, you don’t want to have their needs conflict with each other, which can happen easily if two applications have dependencies that transitively depend on different versions of the same software. This is called “dependency hell.” In all of these, you protect yourself from becoming dependent on arbitrary choices you made before you knew better. One industry-strength approach to hermeticity is to use containers. A container is a virtual machine devoted to running a single program, with explicitly-defined exposure to the outside world. We will set up three containers: one to run the database, one for the search application, and (later) one for a web server. We’ll start by deploying them all on the same machine, but could also deploy them to different machines. Docker is a popular containerization system. Before going on, I should stress that Docker, while I like it, is not sacred by any means. In a decade Docker may disappear, but the principle of hermeticity and the need for reproducible deployments will persist. Docker allows you to describe your container by first starting from an existing (trusted) container, such as one that has an operating system and postgres already installed, and extend it for your application. This includes installing dependencies, fetching the application code from git, copying files into the container from the host system, exposing the container’s network ports, and launching the application. You save the commands that accomplish that in a Dockerfile with some special syntax. To deploy it, you copy the Dockerfile to the server (say, via git) and run docker commands to launch the container. You only have to get the Dockerfile right once, you can test it locally, and then it will work on any server just the same. The only caveat I’ve seen here is that if you migrate to a server with a different processor architecture, the install script (in our case, pip install numba) may fail to find a pre-compiled binary for the target architecture, and it may fall back to compiling from source, which can add additional requirements or force you to change which OS your container is derived from. This reduces our “set up a new server” script to just a few operations: (1) install docker (2) fetch the repository (3) launch the docker containers from their respective Dockerfiles. In my experience, writing a Dockerfile is no small task, but figuring out how to install stuff is awful in all cases, and doing it for Docker gives you an artifact tracing the steps, and a reasonable expectation of not having to do it again. Thankfully, you dear readers can skip my head-banging and see the Dockerfiles after I figured it out. The Postgres Dockerfile This commit adds a Dockerfile for the database, and makes some small changes to the project to allow it to run. It has only 15 lines, but it took me a few hours to figure out. The process was similar to installing confusing software on your own machine: try to install, see some error like "missing postgres.h“, go hunt around on the internet to figure out what you have to install to get past the error, and repeat. Let’s go through each line of the Dockerfile. FROM postgres:12  The first line defines the container image that this container starts from, which is officially maintained by the Postgres team. Looking at their Dockerfile, it starts from debian:buster-slim, which is a Debian Linux instance that is “slimmed” down to be suitable for docker containers, meaning it has few packages pre-installed. Most importantly, “Debian” tells us what package manager to use (apt-get) in our Dockerfile. It’s also worth noting at this point that, when docker builds the container, each command in a docker file results in a new image. An image is a serialized copy of all the data in a docker container, so that it can be started or extended easily. And if you change a line halfway through your Dockerfile, docker only has to rebuild images from that step onward. You can publish images on the web, and other docker users can use them as a base. This is like forking a project on Github, and is exactly what happens when Docker executes FROM postgres:12. ENV POSTGRES_USER docker ENV POSTGRES_PASSWORD docker ENV POSTGRES_DB divisor  These lines declare configuration for the database that the base postgres image will create when the container is started. The variable names are described in the “Environment Variables” section of the Postgres image’s documentation. The ENV command tells docker to instantiate environment variables (like the PATH variable in a terminal shell), that running programs can access. I’m insecurely showing the password and username here because the server the docker containers will run on won’t yet expose anything to the outside world. Later in this post you will see how to pass an environment variable from the docker command line when the container is run, and you would use something close to that to set configuration secrets securely. RUN apt-get update \ && apt-get install -y pgxnclient build-essential libgmp3-dev postgresql-server-dev-12 libmpc-dev  The RUN command allows you to run any shell command you’d like, in this case a command to update apt and install the dependencies needed to build the pgmp extension. This includes gcc and make via build-essential, and the gmp-specific libraries. RUN apt-get install -y python3.7 python3-setuptools python3-pip python-pip python3.7-dev \ && pip3 install wheel \ && pip install six  Next we do something a bit strange. We install python3.7 and pip (because we will need to pip3 install our project’s requirements.txt), but also python2’s pip. Here’s what’s going on. The pgmp postgres extension needs to be built from source, and it has a dependency on python2.7 and the python2-six library. So the first RUN line here installs all the python-related tools we need. RUN pgxn install pgmp  Then we install the pgmp extension. COPY . /divisor WORKDIR "/divisor"  These next two lines copy the current directory on the host machine to the container’s file system, and sets the working directory for all future commands to that directory. Note that whenever the contents of our project change, docker needs to rebuild the image from this step because any subsequent steps like pip install -r requirements.txt might have a different outcome. RUN python3 -m pip install --upgrade pip RUN pip3 install -r requirements.txt  Next we upgrade pip (which is oddly required for the numba dependency, though I can’t re-find the Github issue where I discovered this) and install the python dependencies for the project. The only reason this is required is because we included the database schema setup in the python script riemann/postgres_batabase.py. So this makes the container a bit more complicated than absolutely necessary. It can be improved later if need be. ENV PGUSER=docker ENV PGPASSWORD=docker ENV PGDATABASE=divisor  These next lines are environment variables used by the psycopg2 python library to infer how to connect to postgres if no database spec is passed in. It would be nice if this was shared with the postgres environment variables, but duplicating it is no problem. COPY setup_schema.sh /docker-entrypoint-initdb.d/  The last line copies a script to a special directory specified by the base postgres Dockerfile. The base dockerfile specifies that any scripts in this directory will be run when the container is started up. In our case, we just call the (idempotent) command to create the database. In a normal container we might specify a command to run when the container is started (our search container, defined next, will do this), but the postgres base image handles this for us by starting the postgres database and exposing the right ports. Finally we can build and run the container docker build -t divisordb -f divisordb.Dockerfile . # ... lots of output ... docker run -d -p 5432:5432 --name divisordb divisordb:latest  After the docker build command—which will take a while—you will be able to see the built images by running docker images, and the final image will have a special tag divisordb. The run command additionally tells docker to run the container as a daemon (a.k.a. in the background) with -d and to -p to publish port 5432 on the host machine and map it to 5432 on the container. This allows external programs and programs on other computers to talk to the container by hitting 0.0.0.0:5432. It also allows other containers to talk to this container, but as we’ll see shortly that requires a bit more work, because inside a container 0.0.0.0 means the container, not the host machine. Finally, one can run the following code on the host machine to check that the database is accepting connections. pg_isready --host 0.0.0.0 --username docker --port 5432 --dbname divisor  If you want to get into the database to run queries, you can run psql with the same flags as pg_isready, or manually enter the container with docker exec -it divisordb bash and run psql from there. psql --host 0.0.0.0 --username docker --port 5432 --dbname divisor Password for user docker: docker divisor=# \d List of relations Schema | Name | Type | Owner --------+--------------------+-------+-------- public | riemanndivisorsums | table | docker public | searchmetadata | table | docker (2 rows)  Look at that. You wanted to disprove the Riemann Hypothesis, and here you are running docker containers. The Search Container Next we’ll add a container for the main search application. Before we do this, it will help to make the main entry point to the program a little bit simpler. This commit modifies populate_database.py‘s main routine to use argparse and some sensible defaults. Now we can run the application with just python -m riemann.populate_database. Then the Dockerfile for the search part is defined in this commit. I’ve copied it below. It’s much simpler than the database, but somehow took just as long for me to build as the database Dockerfile, because I originally chose a base image called “alpine” that is (unknown to me at the time) really bad for Python if your dependencies have compiled C code, like numba does. FROM python:3.7-slim-buster RUN apt-get update \ && apt-get install -y build-essential libgmp3-dev libmpc-dev COPY . /divisor WORKDIR "/divisor" RUN pip3 install -r requirements.txt ENV PGUSER=docker ENV PGPASSWORD=docker ENV PGDATABASE=divisor ENTRYPOINT ["python3", "-m", "riemann.populate_database"]  The base image is again Debian, with Python3.7 pre-installed. Then we can build it and (almost) run it docker build -t divisorsearch -f divisorsearch.Dockerfile . docker run -d --name divisorsearch --env PGHOST="$PGHOST" divisorsearch:latest


What’s missing here is the PGHOST environment variable, which psycopg2 uses to find the database. The problem is, inside the container “localhost” and 0.0.0.0 are interpreted by the operating system to mean the container itself, not the host machine. To get around this problem, docker maintains IP addresses for each docker container, and uses those to route network requests between containers. The docker inspect command exposes information about this. Here’s a sample of the output

$docker inspect divisordb [ { "Id": "f731a78bde50be3de1d77ae1cff6d23c7fe21d4dbe6a82b31332c3ef3f6bbbb4", "Path": "docker-entrypoint.sh", "Args": [ "postgres" ], "State": { "Status": "running", "Running": true, "Paused": false, ... }, ... "NetworkSettings": { ... "Ports": { "5432/tcp": [ { "HostIp": "0.0.0.0", "HostPort": "5432" } ] }, ... "IPAddress": "172.17.0.2", ... } } ]  The part that matters for us is the ip address, and the following extracts it to the environment variable PGHOST. export PGHOST=$(docker inspect -f "{{ .NetworkSettings.IPAddress }}" divisordb)


Once the two containers are running—see docker ps for the running containers, docker ps -a to see any containers that were killed due to an error, and docker logs to see the container’s logged output—you can check the database to see it’s being populated.

divisor=# select * from SearchMetadata order by start_time desc limit 10;
start_time         |          end_time          |       search_state_type       | starting_search_state | ending_search_state
----------------------------+----------------------------+-------------------------------+-----------------------+---------------------
2020-12-27 03:10:01.256996 | 2020-12-27 03:10:03.594773 | SuperabundantEnumerationIndex | 29,1541               | 31,1372
2020-12-27 03:09:59.160157 | 2020-12-27 03:10:01.253247 | SuperabundantEnumerationIndex | 26,705                | 29,1541
2020-12-27 03:09:52.035991 | 2020-12-27 03:09:59.156464 | SuperabundantEnumerationIndex | 1,0                   | 26,705


Ship it!

I have an AWS account, so let’s use Amazon for this. Rather than try the newfangled beanstalks or lightsails or whatever AWS-specific frameworks they’re trying to sell, for now I’ll provision a single Ubuntu EC2 server and run everything on it. I picked a t2.micro for testing (which is free). There’s a bit of setup to configure and launch the server—such as picking the server image, downloading an ssh key, and finding the IP address. I’ll skip those details since they are not (yet) relevant to the engineering process.

Once I have my server, I can ssh in, install docker, git clone the project, and run the deploy script.

# install docker, see get.docker.com
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker ubuntu

# log out and log back in

git clone https://github.com/j2kun/riemann-divisor-sum && cd riemann-divisor-sum
bash deploy.sh


And it works!

Sadly, within an hour the divisorsearch container crashes because the instance runs out of RAM and CPU. Upgrading to a t2.medium (4 GiB RAM), it goes for about 2 hours before exhausting RAM. We could profile it and find the memory hotspots, but instead let’s apply a theorem due to billionaire mathematician Jim Simons: throw money at the problem. Upgrading to a r5.large (16 GiB RAM), and it runs comfortably all day.

Four days later, logging back into the VM and and I notice things are sluggish, even though the docker instance isn’t exhausting the total available RAM or CPU. docker stats also shows low CPU usage on divisorsearch. The database shows that it has only got up to 75 divisors, which is just as far as it got when I ran it (not in Docker) on my laptop for a few hours in the last article.

Something is amiss, and we’ll explore what happened next time.

Notes

A few notes on improvements that didn’t make it into this article.

In our deployment, we rebuild the docker containers each time, even when nothing changes. What one could do instead is store the built images in what’s called a container registry, and pull them instead of re-building them on every deploy. This would only save us a few minutes of waiting, but is generally good practice.

We could also use docker compose and a corresponding configuration file to coordinate launching a collection of containers that have dependencies on each other. For our case, the divisorsearch container depended on the divisordb container, and our startup script added a sleep 5 to ensure the latter was running before starting the former. docker compose would automatically handle that, as well as the configuration for naming, resource limits, etc. With only two containers it’s not that much more convenient, given that docker compose is an extra layer of indirection to learn that hides the lower-level commands.

In this article we deployed a single database container and a single “search” container. Most of the time the database container is sitting idle while the search container does its magic. If we wanted to scale up, an obvious way would be to have multiple workers. But it would require some decent feature work. A sketch: reorganize the SearchMetadata table so that it contains a state attribute, like “not started”, “started”, or “finished,” then add functionality so that a worker (atomically) asks for the oldest “not started” block and updates the row’s state to “started.” When a worker finishes a block, it updates the database and marks the block as finished. If no “not started” blocks are found, the worker proceeds to create some number of new “not started” blocks. There are details to be ironed out around race conditions between multiple workers, but Postgres is designed to make such things straightforward.

Finally, we could reduce the database size by keeping track of a summary of a search block instead of storing all the data in the block. For example, we could record the n and witness_value corresponding to the largest witness_value in a block, instead of saving every n and every witness_value. In order for this to be usable—i.e., for us to be able to say “we checked all possible $n < M$ and found no counterexamples”—we’d want to provide a means to verify the approach, say, by randomly verifying the claimed maxima of a random subset of blocks. However, doing this also precludes the ability to analyze patterns that look at all the data. So it’s a tradeoff.

Optimization Models for Subset Cover

In a recent newsletter article I complained about how researchers mislead about the applicability of their work. I gave SAT solvers as an example. People provided interesting examples in response, but what was new to me was the concept of SMT (Satisfiability Modulo Theories), an extension to SAT. SMT seems to have more practical uses than vanilla SAT (see the newsletter for details).

I wanted to take some time to explore SMT solvers, and I landed on Z3, an open-source SMT solver from Microsoft. In particular, I wanted to compare it to ILP (Integer Linear Programing) solvers, which I know relatively well. I picked a problem that I thought would work better for SAT-ish solvers than for ILPs: subset covering (explained in the next section). If ILP still wins against Z3, then that would be not so great for the claim that SMT is a production strength solver.

All the code used for this post is on Github.

Subset covering

A subset covering is a kind of combinatorial design, which can be explained in terms of magic rings.

An adventurer stumbles upon a chest full of magic rings. Each ring has a magical property, but some pairs of rings, when worn together on the same hand, produce a combined special magical effect distinct to that pair.

The adventurer would like to try all pairs of rings to catalogue the magical interactions. With only five fingers, how can we minimize the time spent trying on rings?

Mathematically, the rings can be described as a set $X$ of size $n$. We want to choose a family $F$ of subsets of $X$, with each subset having size 5 (five fingers), such that each subset of $X$ of size 2 (pairs of rings) is contained in some subset of $F$. And we want $F$ to be as small as possible.

Subset covering is not a “production worthy” problem. Rather, I could imagine it’s useful in some production settings, but I haven’t heard of one where it is actually used. I can imagine, for instance, that a cluster of machines has some bug occurring seemingly at random for some point-to-point RPCs, and in tracking down the problem, you want to deploy a test change to subsets of servers to observe the bug occurring. Something like an experiment design problem.

If you generalize the “5” in “5 fingers” to an arbitrary positive integer $k$, and the “2” in “2 rings” to $l < k$, then we have the general subset covering problem. Define $M(n, k, l)$ to be the minimal number of subsets of size $k$ needed to cover all subsets of size $l$. This problem was studied by Erdős, with a conjecture subsequently proved by Vojtěch Rödl, that asymptotically $M(n,k,l)$ grows like $\binom{n}{l} / \binom{k}{l}$. Additional work by Joel Spencer showed that a greedy algorithm is essentially optimal.

However, all of the constructive algorithms in these proofs involve enumerating all $\binom{n}{k}$ subsets of $X$. This wouldn’t scale very well. You can alternatively try a “random” method, incurring a typically $\log(r)$ factor of additional sets required to cover a $1 - 1/r$ fraction of the needed subsets. This is practical, but imperfect.

To the best of my knowledge, there is no exact algorithm, that both achieves the minimum and is efficient in avoiding constructing all $\binom{n}{k}$ subsets. So let’s try using an SMT solver. I’ll be using the Python library for Z3.

Baseline: brute force Z3

For a baseline, let’s start with a simple Z3 model that enumerates all the possible subsets that could be chosen. This leads to an exceedingly simple model to compare the complex models against.

Define boolean variables $\textup{Choice}_S$ which is true if and only if the subset $S$ is chosen (I call this a “choice set”). Define boolean variables $\textup{Hit}_T$ which is true if the subset $T$ (I call this a “hit set”) is contained in a chosen choice set. Then the subset cover problem can be defined by two sets of implications.

First, if $\textup{Choice}_S$ is true, then so must all $\textup{Hit}_T$ for $T \subset S$. E.g., for $S = \{ 1, 2, 3 \}$ and $l=2$, we get

\displaystyle \begin{aligned} \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(1,2)} \\ \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(1,3)} \\ \textup{Choice}_{(1,2,3)} &\implies \textup{Hit}_{(2,3)} \end{aligned}

In Python this looks like the following (note this program has some previously created lookups and data structures containing the variables)

for choice_set in choice_sets:
for hit_set_key in combinations(choice_set.elements, hit_set_size):
hit_set = hit_set_lookup[hit_set_key]
implications.append(
z3.Implies(choice_set.variable, hit_set.variable))


Second, if $\textup{Hit}_T$ is true, it must be that some $\textup{Choice}_S$ is true for some $S$ containing $T$ as a subset. For example,

\displaystyle \begin{aligned} \textup{Hit}_{(1,2)} &\implies \\ & \textup{Choice}_{(1,2,3,4)} \textup{ OR} \\ & \textup{Choice}_{(1,2,3,5)} \textup{ OR} \\ & \textup{Choice}_{(1,2,4,5)} \textup{ OR } \cdots \\ \end{aligned}

In code,

for hit_set in hit_sets.values():
relevant_choice_set_vars = [
choice_set.variable
for choice_set in hit_set_to_choice_set_lookup[hit_set]
]
implications.append(
z3.Implies(
hit_set.variable,
z3.Or(*relevant_choice_set_vars)))



Next, in this experiment we’re allowing the caller to specify the number of choice sets to try, and the solver should either return SAT or UNSAT. From that, we can use a binary search to find the optimal number of sets to pick. Thus, we have to limit the number of $\textup{Choice}_S$ that are allowed to be true and false. Z3 supports boolean cardinality constraints, apparently with a special solver to handle problems that have them. Otherwise, the process of encoding cardinality constraints as SAT formulas is not trivial (and the subject of active research). But the code is simple enough:

args = [cs.variable for cs in choice_sets] + [parameters.num_choice_sets]
choice_sets_at_most = z3.AtMost(*args)
choice_sets_at_least = z3.AtLeast(*args)


Finally, we must assert that every $\textup{Hit}_T$ is true.

solver = z3.Solver()
for hit_set in hit_sets.values():

for impl in implications:



Running it for $n=7, k=3, l=2$, and seven choice sets (which is optimal), we get

>>> SubsetCoverZ3BruteForce().solve(
SubsetCoverParameters(
num_elements=7,
choice_set_size=3,
hit_set_size=2,
num_choice_sets=7))
[(0, 1, 3), (0, 2, 4), (0, 5, 6), (1, 2, 6), (1, 4, 5), (2, 3, 5), (3, 4, 6)]
SubsetCoverSolution(status=<SolveStatus.SOLVED: 1>, solve_time_seconds=0.018305063247680664)


Interestingly, Z3 refuses to solve marginally larger instances. For instance, I tried the following and Z3 times out around $n=12, k=6$ (about 8k choice sets):

from math import comb

for n in range(8, 16):
k = int(n / 2)
l = 3
max_num_sets = int(2 * comb(n, l) / comb(k, l))
params = SubsetCoverParameters(
num_elements=n,
choice_set_size=k,
hit_set_size=l,
num_choice_sets=max_num_sets)

print_table(
params,
SubsetCoverZ3BruteForce().solve(params),


After taking a long time to generate the larger models, Z3 exceeds my 15 minute time limit, suggesting exponential growth:

status               solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED   0.0271              8             4                3             28
SolveStatus.SOLVED   0.0346              9             4                3             42
SolveStatus.SOLVED   0.0735              10            5                3             24
SolveStatus.SOLVED   0.1725              11            5                3             33
SolveStatus.SOLVED   386.7376            12            6                3             22
SolveStatus.UNKNOWN  900.1419            13            6                3             28
SolveStatus.UNKNOWN  900.0160            14            7                3             20
SolveStatus.UNKNOWN  900.0794            15            7                3             26


An ILP model

Next we’ll see an ILP model for the sample problem. Note there are two reasons I expect the ILP model to fall short. First, the best solver I have access to is SCIP, which, despite being quite good is, in my experience, about an order of magnitude slower than commercial alternatives like Gurobi. Second, I think this sort of problem seems to not be very well suited to ILPs. It would take quite a bit longer to explain why (maybe another post, if you’re interested), but in short well-formed ILPs have easily found feasible solutions (this one does not), and the LP-relaxation of the problem should be as tight as possible. I don’t think my formulation is very tight, but it’s possible there is a better formulation.

Anyway, the primary difference in my ILP model from brute force is that the number of choice sets is fixed in advance, and the members of the choice sets are model variables. This allows us to avoid enumerating all choice sets in the model.

In particular, $\textup{Member}_{S,i} \in \{ 0, 1 \}$ is a binary variable that is 1 if and only if element $i$ is assigned to be in set $S$. And $\textup{IsHit}_{T, S} \in \{0, 1\}$ is 1 if and only if the hit set $T$ is a subset of $S$. Here “$S$” is an index over the subsets, rather than the set itself, because we don’t know what elements are in $S$ while building the model.

For the constraints, each choice set $S$ must have size $k$:

$\displaystyle \sum_{i \in X} \textup{Member}_{S, i} = k$

Each hit set $T$ must be hit by at least one choice set:

$\displaystyle \sum_{S} \textup{IsHit}_{T, S} \geq 1$

Now the tricky constraint. If a hit set $T$ is hit by a specific choice set $S$ (i.e., $\textup{IsHit}_{T, S} = 1$) then all the elements in $T$ must also be members of $S$.

$\displaystyle \sum_{i \in T} \textup{Member}_{S, i} \geq l \cdot \textup{IsHit}_{T, S}$

This one works by the fact that the left-hand side (LHS) is bounded from below by 0 and bounded from above by $l = |T|$. Then $\textup{IsHit}_{T, S}$ acts as a switch. If it is 0, then the constraint is vacuous since the LHS is always non-negative. If $\textup{IsHit}_{T, S} = 1$, then the right-hand side (RHS) is $l = |T|$ and this forces all variables on the LHS to be 1 to achieve it.

Because we fixed the number of choice sets as a parameter, the objective is 1, and all we’re doing is looking for a feasible solution. The full code is here.

On the same simple example as the brute force

>>> SubsetCoverILP().solve(
SubsetCoverParameters(
num_elements=7,
choice_set_size=3,
hit_set_size=2,
num_choice_sets=7))
[(0, 1, 3), (0, 2, 6), (0, 4, 5), (1, 2, 4), (1, 5, 6), (2, 3, 5), (3, 4, 6)]
SubsetCoverSolution(status=<SolveStatus.SOLVED: 1>, solve_time_seconds=0.1065816879272461)


It finds the same solution in about 10x the runtime as the brute force Z3 model, though still well under one second.

On the “scaling” example, it fares much worse. With a timeout of 15 minutes, it solves $n=8,$ decently fast, $n=9,12$ slowly, and times out on the rest.

status               solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED   1.9969              8             4                3             28
SolveStatus.SOLVED   306.4089            9             4                3             42
SolveStatus.UNKNOWN  899.8842            10            5                3             24
SolveStatus.UNKNOWN  899.4849            11            5                3             33
SolveStatus.SOLVED   406.9502            12            6                3             22
SolveStatus.UNKNOWN  902.7807            13            6                3             28
SolveStatus.UNKNOWN  900.0826            14            7                3             20
SolveStatus.UNKNOWN  900.0731            15            7                3             26


A Z3 Boolean Cardinality Model

The next model uses Z3. It keeps the concept of Member and Hit variables, but they are boolean instead of integer. It also replaces the linear constraints with implications. The constraint that forces a Hit set’s variable to be true when some Choice set contains all its elements is (for each $S, T$)

$\displaystyle \left ( \bigwedge_{i \in T} \textup{Member}_{S, i} \right ) \implies \textup{IsHit}_T$

Conversely, A Hit set’s variable being true implies its members are in some choice set.

$\displaystyle \textup{IsHit}_T \implies \bigvee_{S} \bigwedge_{i \in T} \textup{Member}_{S, i}$

Finally, we again use boolean cardinality constraints AtMost and AtLeast so that each choice set has the right size.

The results are much better than the ILP: it solves all of the instances in under 3 seconds

status              solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED  0.0874              8             4                3             28
SolveStatus.SOLVED  0.1861              9             4                3             42
SolveStatus.SOLVED  0.1393              10            5                3             24
SolveStatus.SOLVED  0.2845              11            5                3             33
SolveStatus.SOLVED  0.2032              12            6                3             22
SolveStatus.SOLVED  1.3661              13            6                3             28
SolveStatus.SOLVED  0.8639              14            7                3             20
SolveStatus.SOLVED  2.4877              15            7                3             26


A Z3 integer model

Z3 supports implications on integer equation equalities, so we can try a model that leverages this by essentially converting the boolean model to one where the variables are 0-1 integers, and the constraints are implications on equality of integer formulas (all of the form “variable = 1”).

I expect this to perform worse than the boolean model, even though the formulation is almost identical. The details of the model are here, and it’s so similar to the boolean model above that it needs no extra explanation.

The runtime is much worse, but surprisingly it still does better than the ILP model.

status              solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SolveStatus.SOLVED  2.1129              8             4                3             28
SolveStatus.SOLVED  14.8728             9             4                3             42
SolveStatus.SOLVED  7.6247              10            5                3             24
SolveStatus.SOLVED  25.0607             11            5                3             33
SolveStatus.SOLVED  30.5626             12            6                3             22
SolveStatus.SOLVED  63.2780             13            6                3             28
SolveStatus.SOLVED  57.0777             14            7                3             20
SolveStatus.SOLVED  394.5060            15            7                3             26


Harder instances

So far all the instances we’ve been giving the solvers are “easy” in a sense. In particular, we’ve guaranteed there’s a feasible solution, and it’s easy to find. We’re giving roughly twice as many sets as are needed. There are two ways to make this problem harder. One is to test on unsatisfiable instances, which can be harder because the solver has to prove it can’t work. Another is to test on satisfiable instances that are hard to find, such as those satisfiable instances where the true optimal number of choice sets is given as the input parameter. The hardest unsatisfiable instances are also the ones where the number of choice sets allowed is one less than optimal.

Let’s test those situations. Since $M(7, 3, 2) = 7$, we can try with 7 choice sets and 6 choice sets.

For 7 choice sets (the optimal value), all the solvers do relatively well

method                    status  solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SubsetCoverILP            SOLVED  0.0843              7             3                2             7
SubsetCoverZ3Integer      SOLVED  0.0938              7             3                2             7
SubsetCoverZ3BruteForce   SOLVED  0.0197              7             3                2             7
SubsetCoverZ3Cardinality  SOLVED  0.0208              7             3                2             7


For 6, the ILP struggles to prove it’s infeasible, and the others do comparatively much better (at least 17x better).

method                    status      solve_time_seconds  num_elements  choice_set_size  hit_set_size  num_choice_sets
SubsetCoverILP            INFEASIBLE  120.8593            7             3                2             6
SubsetCoverZ3Integer      INFEASIBLE  3.0792              7             3                2             6
SubsetCoverZ3BruteForce   INFEASIBLE  0.3384              7             3                2             6
SubsetCoverZ3Cardinality  INFEASIBLE  7.5781              7             3                2             6


This seems like hard evidence that Z3 is better than ILPs for this problem (and it is), but note that the same test on $n=8$ fails for all models. They can all quickly prove $8 < M(8, 3, 2) \leq 11$, but time out after twenty minutes when trying to determine if $M(8, 3, 2) \in \{ 9, 10 \}$. Note that $k=3, l=2$ is the least complex choice for the other parameters, so it seems like there’s not much hope to find $M(n, k, l)$ for any seriously large parameters, like, say, $k=6$.

Thoughts

These experiments suggest what SMT solvers can offer above and beyond ILP solvers. Disjunctions and implications are notoriously hard to model in an ILP. You often need to add additional special variables, or do tricks like the one I did that only work in some situations and which can mess with the efficiency of the solver. With SMT, implications are trivial to model, and natively supported by the solver.

Aside from reading everything I could find on Z3, there seems to be little advice on modeling to help the solver run faster. There is a ton of literature for this in ILP solvers, but if any readers see obvious problems with my SMT models, please chime in! I’d love to hear from you. Even without that, I am pretty impressed by how fast the solves finish for this subset cover problem (which this experiment has shown me is apparently a very hard problem).

However, there’s an elephant in the room. These models are all satisfiability/feasibility checks on a given solution. What is not tested here is optimization, in the sense of having the number of choice sets used be minimized directly by the solver. In a few experiments on even simpler models, z3 optimization is quite slow. And while I know how I’d model the ILP version of the optimization problem, given that it’s quite slow to find a feasible instance when the optimal number of sets is given as a parameter, it seems unlikely that it will be fast when asked to optimize. I will have to try that another time to be sure.

Also, I’d like to test the ILP models on Gurobi, but I don’t have a personal license. There’s also the possibility that I can come up with a much better ILP formulation, say, with a tighter LP relaxation. But these will have to wait for another time.

In the end, this experiment has given me some more food for thought, and concrete first-hand experience, on the use of SMT solvers.