I wanted to take some time to explore SMT solvers, and I landed on Z3, an open-source SMT solver from Microsoft. In particular, I wanted to compare it to ILP (Integer Linear Programing) solvers, which I know relatively well. I picked a problem that I thought would work better for SAT-ish solvers than for ILPs: *subset covering* (explained in the next section). If ILP still wins against Z3, then that would be not so great for the claim that SMT is a production strength solver.

All the code used for this post is on Github.

A subset covering is a kind of combinatorial design, which can be explained in terms of magic rings.

An adventurer stumbles upon a chest full of magic rings. Each ring has a magical property, but some pairs of rings, when worn together on the same hand, produce a combined special magical effect distinct to that pair.

The adventurer would like to try all pairs of rings to catalogue the magical interactions. With only five fingers, how can we minimize the time spent trying on rings?

Mathematically, the rings can be described as a set of size . We want to choose a family of subsets of , with each subset having size 5 (five fingers), such that each subset of of size 2 (pairs of rings) is contained in some subset of . And we want to be as small as possible.

Subset covering is not a “production worthy” problem. Rather, I could* imagine* it’s useful in some production settings, but I haven’t heard of one where it is actually used. I can imagine, for instance, that a cluster of machines has some bug occurring seemingly at random for some point-to-point RPCs, and in tracking down the problem, you want to deploy a test change to subsets of servers to observe the bug occurring. Something like an experiment design problem.

If you generalize the “5” in “5 fingers” to an arbitrary positive integer , and the “2” in “2 rings” to , then we have the general subset covering problem. Define to be the minimal number of subsets of size needed to cover all subsets of size . This problem was studied by Erdős, with a conjecture subsequently proved by Vojtěch Rödl, that asymptotically grows like . Additional work by Joel Spencer showed that a greedy algorithm is essentially optimal.

However, all of the constructive algorithms in these proofs involve enumerating all subsets of . This wouldn’t scale very well. You can alternatively try a “random” method, incurring a typically factor of additional sets required to cover a fraction of the needed subsets. This is practical, but imperfect.

To the best of my knowledge, there is no exact algorithm, that both achieves the minimum and is efficient in avoiding constructing all subsets. So let’s try using an SMT solver. I’ll be using the Python library for Z3.

For a baseline, let’s start with a simple Z3 model that enumerates all the possible subsets that could be chosen. This leads to an exceedingly simple model to compare the complex models against.

Define boolean variables which is true if and only if the subset is chosen (I call this a “choice set”). Define boolean variables which is true if the subset (I call this a “hit set”) is contained in a chosen choice set. Then the subset cover problem can be defined by two sets of implications.

First, if is true, then so must all for . E.g., for and , we get

In Python this looks like the following (note this program has some previously created lookups and data structures containing the variables)

for choice_set in choice_sets: for hit_set_key in combinations(choice_set.elements, hit_set_size): hit_set = hit_set_lookup[hit_set_key] implications.append( z3.Implies(choice_set.variable, hit_set.variable))

Second, if is true, it must be that some is true for some containing as a subset. For example,

In code,

for hit_set in hit_sets.values(): relevant_choice_set_vars = [ choice_set.variable for choice_set in hit_set_to_choice_set_lookup[hit_set] ] implications.append( z3.Implies( hit_set.variable, z3.Or(*relevant_choice_set_vars)))

Next, in this experiment we’re allowing the caller to specify the number of choice sets to try, and the solver should either return SAT or UNSAT. From that, we can use a binary search to find the optimal number of sets to pick. Thus, we have to limit the number of that are allowed to be true and false. Z3 supports boolean cardinality constraints, apparently with a special solver to handle problems that have them. Otherwise, the process of encoding cardinality constraints as SAT formulas is not trivial (and the subject of active research). But the code is simple enough:

args = [cs.variable for cs in choice_sets] + [parameters.num_choice_sets] choice_sets_at_most = z3.AtMost(*args) choice_sets_at_least = z3.AtLeast(*args)

Finally, we must assert that every is true.

solver = z3.Solver() for hit_set in hit_sets.values(): solver.add(hit_set.variable) for impl in implications: solver.add(impl) solver.add(choice_sets_at_most) solver.add(choice_sets_at_least)

Running it for , and seven choice sets (which is optimal), we get

>>> SubsetCoverZ3BruteForce().solve( SubsetCoverParameters( num_elements=7, choice_set_size=3, hit_set_size=2, num_choice_sets=7)) [(0, 1, 3), (0, 2, 4), (0, 5, 6), (1, 2, 6), (1, 4, 5), (2, 3, 5), (3, 4, 6)] SubsetCoverSolution(status=<SolveStatus.SOLVED: 1>, solve_time_seconds=0.018305063247680664)

Interestingly, Z3 refuses to solve marginally larger instances. For instance, I tried the following and Z3 times out around (about 8k choice sets):

from math import comb for n in range(8, 16): k = int(n / 2) l = 3 max_num_sets = int(2 * comb(n, l) / comb(k, l)) params = SubsetCoverParameters( num_elements=n, choice_set_size=k, hit_set_size=l, num_choice_sets=max_num_sets) print_table( params, SubsetCoverZ3BruteForce().solve(params), header=(n==8))

After taking a long time to generate the larger models, Z3 exceeds my 15 minute time limit, suggesting exponential growth:

status solve_time_seconds num_elements choice_set_size hit_set_size num_choice_sets SolveStatus.SOLVED 0.0271 8 4 3 28 SolveStatus.SOLVED 0.0346 9 4 3 42 SolveStatus.SOLVED 0.0735 10 5 3 24 SolveStatus.SOLVED 0.1725 11 5 3 33 SolveStatus.SOLVED 386.7376 12 6 3 22 SolveStatus.UNKNOWN 900.1419 13 6 3 28 SolveStatus.UNKNOWN 900.0160 14 7 3 20 SolveStatus.UNKNOWN 900.0794 15 7 3 26

Next we’ll see an ILP model for the sample problem. Note there are two reasons I expect the ILP model to fall short. First, the best solver I have access to is SCIP, which, despite being quite good is, in my experience, about an order of magnitude slower than commercial alternatives like Gurobi. Second, I think this sort of problem seems to not be very well suited to ILPs. It would take quite a bit longer to explain why (maybe another post, if you’re interested), but in short well-formed ILPs have easily found feasible solutions (this one does not), and the LP-relaxation of the problem should be as tight as possible. I don’t think my formulation is very tight, but it’s possible there is a better formulation.

Anyway, the primary difference in my ILP model from brute force is that the number of choice sets is fixed in advance, and the members of the choice sets are model variables. This allows us to avoid enumerating all choice sets in the model.

In particular, is a binary variable that is 1 if and only if element is assigned to be in set . And is 1 if and only if the hit set is a subset of . Here “” is an index over the subsets, rather than the set itself, because we don’t know what elements are in while building the model.

For the constraints, each choice set must have size :

Each hit set must be hit by at least one choice set:

Now the tricky constraint. If a hit set is hit by a specific choice set (i.e., ) then all the elements in must also be members of .

This one works by the fact that the left-hand side (LHS) is bounded from below by 0 and bounded from above by . Then acts as a switch. If it is 0, then the constraint is vacuous since the LHS is always non-negative. If , then the right-hand side (RHS) is and this forces all variables on the LHS to be 1 to achieve it.

Because we fixed the number of choice sets as a parameter, the objective is 1, and all we’re doing is looking for a feasible solution. The full code is here.

On the same simple example as the brute force

>>> SubsetCoverILP().solve( SubsetCoverParameters( num_elements=7, choice_set_size=3, hit_set_size=2, num_choice_sets=7)) [(0, 1, 3), (0, 2, 6), (0, 4, 5), (1, 2, 4), (1, 5, 6), (2, 3, 5), (3, 4, 6)] SubsetCoverSolution(status=<SolveStatus.SOLVED: 1>, solve_time_seconds=0.1065816879272461)

It finds the same solution in about 10x the runtime as the brute force Z3 model, though still well under one second.

On the “scaling” example, it fares much worse. With a timeout of 15 minutes, it solves decently fast, slowly, and times out on the rest.

status solve_time_seconds num_elements choice_set_size hit_set_size num_choice_sets SolveStatus.SOLVED 1.9969 8 4 3 28 SolveStatus.SOLVED 306.4089 9 4 3 42 SolveStatus.UNKNOWN 899.8842 10 5 3 24 SolveStatus.UNKNOWN 899.4849 11 5 3 33 SolveStatus.SOLVED 406.9502 12 6 3 22 SolveStatus.UNKNOWN 902.7807 13 6 3 28 SolveStatus.UNKNOWN 900.0826 14 7 3 20 SolveStatus.UNKNOWN 900.0731 15 7 3 26

The next model uses Z3. It keeps the concept of Member and Hit variables, but they are boolean instead of integer. It also replaces the linear constraints with implications. The constraint that forces a Hit set’s variable to be true when some Choice set contains all its elements is (for each )

Conversely, A Hit set’s variable being true implies its members are in some choice set.

Finally, we again use boolean cardinality constraints AtMost and AtLeast so that each choice set has the right size.

The results are much better than the ILP: it solves all of the instances in under 3 seconds

status solve_time_seconds num_elements choice_set_size hit_set_size num_choice_sets SolveStatus.SOLVED 0.0874 8 4 3 28 SolveStatus.SOLVED 0.1861 9 4 3 42 SolveStatus.SOLVED 0.1393 10 5 3 24 SolveStatus.SOLVED 0.2845 11 5 3 33 SolveStatus.SOLVED 0.2032 12 6 3 22 SolveStatus.SOLVED 1.3661 13 6 3 28 SolveStatus.SOLVED 0.8639 14 7 3 20 SolveStatus.SOLVED 2.4877 15 7 3 26

Z3 supports implications on integer equation equalities, so we can try a model that leverages this by essentially converting the boolean model to one where the variables are 0-1 integers, and the constraints are implications on equality of integer formulas (all of the form “variable = 1”).

I expect this to perform worse than the boolean model, even though the formulation is almost identical. The details of the model are here, and it’s so similar to the boolean model above that it needs no extra explanation.

The runtime is much worse, but surprisingly it still does better than the ILP model.

status solve_time_seconds num_elements choice_set_size hit_set_size num_choice_sets SolveStatus.SOLVED 2.1129 8 4 3 28 SolveStatus.SOLVED 14.8728 9 4 3 42 SolveStatus.SOLVED 7.6247 10 5 3 24 SolveStatus.SOLVED 25.0607 11 5 3 33 SolveStatus.SOLVED 30.5626 12 6 3 22 SolveStatus.SOLVED 63.2780 13 6 3 28 SolveStatus.SOLVED 57.0777 14 7 3 20 SolveStatus.SOLVED 394.5060 15 7 3 26

So far all the instances we’ve been giving the solvers are “easy” in a sense. In particular, we’ve guaranteed there’s a feasible solution, and it’s easy to find. We’re giving roughly twice as many sets as are needed. There are two ways to make this problem harder. One is to test on *unsatisfiable* instances, which can be harder because the solver has to prove it can’t work. Another is to test on satisfiable instances that are *hard* to find, such as those satisfiable instances where the true optimal number of choice sets is given as the input parameter. The hardest unsatisfiable instances are also the ones where the number of choice sets allowed is one less than optimal.

Let’s test those situations. Since , we can try with 7 choice sets and 6 choice sets.

For 7 choice sets (the optimal value), all the solvers do relatively well

method status solve_time_seconds num_elements choice_set_size hit_set_size num_choice_sets SubsetCoverILP SOLVED 0.0843 7 3 2 7 SubsetCoverZ3Integer SOLVED 0.0938 7 3 2 7 SubsetCoverZ3BruteForce SOLVED 0.0197 7 3 2 7 SubsetCoverZ3Cardinality SOLVED 0.0208 7 3 2 7

For 6, the ILP struggles to prove it’s infeasible, and the others do comparatively much better (at least 17x better).

method status solve_time_seconds num_elements choice_set_size hit_set_size num_choice_sets SubsetCoverILP INFEASIBLE 120.8593 7 3 2 6 SubsetCoverZ3Integer INFEASIBLE 3.0792 7 3 2 6 SubsetCoverZ3BruteForce INFEASIBLE 0.3384 7 3 2 6 SubsetCoverZ3Cardinality INFEASIBLE 7.5781 7 3 2 6

This seems like hard evidence that Z3 is better than ILPs for this problem (and it is), but note that the same test on fails for all models. They can all quickly prove , but time out after twenty minutes when trying to determine if . Note that is the least complex choice for the other parameters, so it seems like there’s not much hope to find for any seriously large parameters, like, say, .

These experiments suggest what SMT solvers can offer above and beyond ILP solvers. Disjunctions and implications are notoriously hard to model in an ILP. You often need to add additional special variables, or do tricks like the one I did that only work in some situations and which can mess with the efficiency of the solver. With SMT, implications are trivial to model, and natively supported by the solver.

Aside from reading everything I could find on Z3, there seems to be little advice on modeling to help the solver run faster. There is a ton of literature for this in ILP solvers, but if any readers see obvious problems with my SMT models, please chime in! I’d love to hear from you. Even without that, I am pretty impressed by how fast the solves finish for this subset cover problem (which this experiment has shown me is apparently a very hard problem).

However, there’s an elephant in the room. These models are all satisfiability/feasibility checks on a given solution. What is not tested here is optimization, in the sense of having the number of choice sets used be minimized directly by the solver. In a few experiments on even simpler models, z3 optimization is quite slow. And while I know how I’d model the ILP version of the optimization problem, given that it’s quite slow to find a feasible instance when the optimal number of sets is given as a parameter, it seems unlikely that it will be fast when asked to optimize. I will have to try that another time to be sure.

Also, I’d like to test the ILP models on Gurobi, but I don’t have a personal license. There’s also the possibility that I can come up with a much better ILP formulation, say, with a tighter LP relaxation. But these will have to wait for another time.

In the end, this experiment has given me some more food for thought, and concrete first-hand experience, on the use of SMT solvers.

]]>In the last article, we improved our naive search from “try all positive integers” to enumerate a subset of integers (superabundant numbers), which RH counterexamples are guaranteed to be among. These numbers grow large, fast, and we quickly reached the limit of what 64 bit integers can store.

Unbounded integer arithmetic is possible on computers, but it requires a special software implementation. In brief, you represent numbers in base-N for some large N (say, ), and then use a 32-bit integer for each digit. Arithmetic on such quantities emulates a ripple-carry adder, which naturally requires linear time in the number of digits of each operand. Artem Golubin has a nice explanation of how Python does it internally.

So Python can handle unbounded integer arithmetic, but neither numba nor our database engine do. Those both crash when exceeding 64-bit integers This is a problem because we won’t be able to store the results of our search without being able to put it in a database. This leaves us with a classic software engineering problem. What’s the path forward?

The impulse answer is to do as little as possible to make the damn thing work. In a situation where the software you’re writing is a prototype, and you expect it to be rewritten from scratch in the future, this is an acceptable attitude. That said, experienced engineers would caution you that, all too often, such “prototypes” are copy-pasted to become janky mission-critical systems for years.

In pretending this is the “real thing,” let’s do what real engineers would do and scope out some alternatives before diving in. The two aspects are our database and the use of numba for performance.

Let’s start with the database. A quick and dirty option: store all numbers as text strings in the database. There’s no limit on the size of the number in that case. The benefit: we don’t need to use a different database engine, and most of our code stays the same. The cost: we can’t use numeric operations in database queries, which would make further analysis and fetching awkward. In particular, we can’t even apply sorting operations, since text strings are sorted lexicographically (e.g., 100, 25) while numbers are sorted by magnitude (25, 100). Note, we applied this “numbers as text” idea to the problem of serializing the search state, and it was hacky there, too.

A second option is to find a database engine with direct support for unbounded-integer arithmetic. The benefit: fast database queries and the confidence that it will support future use cases well. The cost: if our existing sqlite-based interface doesn’t work with the new database engine, we’d have to write another implementation of our database interface.

For numba, we have at least three options. First, fall back to native python arithmetic, which is slow. Second, implement arbitrary-precision arithmetic in Python in a way that numba can compile it. Third, find (or implement) a C-implementation of arbitrary precision integer arithmetic, provide Python bindings, and optionally see if it can work with (or replace) numba. As I write this I haven’t yet tried any of these options. My intuition tells me the best way to go would be to find “proper” support for arbitrary precision integers.

For the database, I recall that the Postgres database engine supports various extensions, for example this extension that adds support for geographic objects. Postgres’s extension framework demonstrates an important software engineering principle that many of the best projects follow: “closed for modification, open for extension.” That is, Postgres is designed so that others can contribute new features to Postgres without requiring the Postgres team to do anything special—specifically, they don’t have to change Postgres to accommodate it. The name for this sometimes goes by extensions, or plug-ins, hooks, or (at a lower level) callbacks. Github Actions is a good example of this.

Geographic objects are almost certainly more complicated than arbitrary precision integers, so chances are good a Postgres extension exists for the latter. Incorporating it would involve migrating to Postgres, finding and installing that extension, and then converting the C library representation above to whatever representation Postgres accepts in a query.

A good route will also ensure that we need not change our tests too much, since all we’re doing here is modifying implementations. We’ll see how well that holds up.

After some digging, I found GMP (GNU Multiple Precision), a C library written by Torbjörn Granlund. It has a Python bindings library called gmpy that allows Python to use an “mpz” (“Multiple Precision “) type as a drop-in replacement for Python integers. And I found a PostgreSQL extension called pgmp. The standard Python library for Postgres is psycopg2, which was written by the same person who wrote pgmp, Daniele Varrazzo.

To start, I ran a timing test of gmpy, which proves to be as fast as numba. This pull request has the details.

It took a small bit of kicking to get pgmp to install, but then I made a test database that uses the new column type mpz and stores the value .

postgres=# create database pgmp_test; CREATE DATABASE postgres=# \connect pgmp_test; You are now connected to database "pgmp_test" as user "jeremy". pgmp_test=# CREATE EXTENSION pgmp; CREATE EXTENSION pgmp_test=# create table test_table (id int4, value mpz); CREATE TABLE pgmp_test=# insert into test_table pgmp_test-# values (1, 2::mpz ^ 513); INSERT 0 1 pgmp_test=# select * from test_table; id | value ----+------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 | 26815615859885194199148049996411692254958731641184786755447122887443528060147093953603748596333806855380063716372972101707507765623893139892867298012168192 (1 row)

Now I’m pretty confident this approach will work.

This pull request includes the necessary commits to add a postgres implementation of our database interface, add tests (which is a minor nuisance).

Then this pull request converts the main divisor computation functions to use gmpy, and this final commit converts the main program to use the postgres database.

This exposed one bug, that I wasn’t converting the new mpz types properly in the postgres sql query. This commit fixes it, and this commit adds a regression test to catch that specific error going forward.

With all that work, I ran the counterexample search for a few hours.

When I stopped it, it had checked all possibly-superabundant numbers whose prime factorizations have at most 75 prime factors, including multiplicity. Since all possible counterexamples to the RH must be superabundant, and all superabundant numbers have the aforementioned special prime factorization, we can say it more simply. I ruled out *all* positive integers whose prime factorization has at most 75 factors.

The top 10 are:

divisor=# select n, witness_value from RiemannDivisorSums where witness_value > 1.7 and n > 5040 order by witness_value desc limit 10; n | witness_value ----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------- 7837096340441581730115353880089927210115664131849557062713735873563599357074419807246597145310377220030504976899588686851652680862494751024960000 | 1.7679071291526642 49445402778811241199465955079431717413978953513246416799455746836363402883750282695562127099750014006501608687063651021146073696293342277760000 | 1.767864530684858 24722701389405620599732977539715858706989476756623208399727873418181701441875141347781063549875007003250804343531825510573036848146671138880000 | 1.767645098171234 157972532839652527793820942745788234549453525601426251755449670403716942120607931934703281468849885004797471843653837128262216282087355520000 | 1.7676163327497005 2149800120817880052150693699105726844086041457097670295628510732015800125380447073720092482597826695934852551611463087875916247664927925120000 | 1.767592584103948 340743319149633988265884951308257704787637570949980741857118951024504319872800861184634658491755531305674129430416899428332725254891076131520000 | 1.767582883432923 23511289021324745190346061640269781630346992395548671188141207620690798071223259421739791435931131660091514930698766060554958042587484253074880000 | 1.7674462177172812 507950266365442211555694349664913937458049921547994378634886400011951582381375986928306371282475514484879330686989829994412271003496320000 | 1.7674395010995763 78986266419826263896910471372894117274726762800713125877724835201858471060303965967351640734424942502398735921826918564131108141043677760000 | 1.7674104158678667 6868370993028370773644388815034271067367544591366358771976072626248562700895997040639273107341299348034672688854514657750531142699450240000 | 1.7674059308384011

This is new. We’ve found quite a few numbers that have a better witness value than which achieves ~1.7558. The best is

78370963404415817301153538800899272101156641318495

57062713735873563599357074419807246597145310377220

030504976899588686851652680862494751024960000

which achieves ~1.7679. Recall the 1.781 threshold needed to be a RH counterexample. We’re about 50% of the way toward disproving RH. How much more work could it take?

But seriously, what’s next with this project? For one, even though we have some monstrous numbers and their divisor sums and witness values, it’s hard to see the patterns in them through a SQL queries. It would be nicer to make some pretty plots.

I could also take a step back and see what could be improved from a software engineering standpoint. For one, not all parts of the application are tested, and tests aren’t automatically run when I make changes. This enabled the bug above where I didn’t properly convert mpz types before passing them to SQL upsert statements. For two, while I have been using type annotations in some places, they aren’t checked, and the switch to mpz has almost certainly made many of the type hints incorrect. I could fix that and set up a test that type checks.

Finally, in the interest of completeness, I could set up a front end that displays some summary of the data, and then deploy the whole application so that it has a continuously-running background search, along with a website that shows how far along the search is. Based on how long the SQL statement to find the top 10 witness values took, this would also likely require some caching, which fits snugly in the class of typical software engineering problems.

Let me know what you’re interested in.

]]>A *superabundant number* is one which has “maximal relative divisor sums” in the following sense: for all ,

where is the sum of the divisors of .

Erdős and Alaoglu proved in 1944 (“On highly composite and similar numbers“) that superabundant numbers have a specific prime decomposition, in which all initial primes occur with non-increasing exponents

where is the i-th prime, and . With two exceptions (), .

Here’s a rough justification for why superabundant numbers should have a decomposition like this. If you want a number with many divisors (compared to the size of the number), you want to pack as many combinations of small primes into the decomposition of your number as possible. Using all 2’s leads to not enough combinations—only divisors for —but using 2′ and 3’s you get for . Using more 3’s trades off a larger number for the benefit of a larger (up to ). The balance between getting more distinct factor combinations and a larger favors packing the primes in there.

Though numbers of this form are not necessarily superabundant, this gives us an enumeration strategy better than trying all numbers. Enumerate over tuples corresponding to the exponents of the prime decomposition (non-increasing lists of integers), and save those primes to make it easier to compute the divisor sum.

Non-increasing lists of integers can be enumerated in the order of their sum, and for each sum , the set of non-increasing lists of integers summing to is called the *partitions* of . There is a simple algorithm to compute them, implemented in this commit. Note this does not enumerate them in order of the magnitude of the number .

The implementation for the prime-factorization-based divisor sum computation is in this commit. In addition, to show some alternative methods of testing, we used the hypothesis library to autogenerate tests. It chooses a random (limited size) prime factorization, and compares the prime-factorization-based algorithm to the naive algorithm. There’s a bit of setup code involved, but as a result we get dozens of tests and more confidence it’s right.

We now have two search strategies over the space of natural numbers, though one is obviously better. We may come up with a third, so it makes sense to separate the search strategy from the main application by an interface. Generally, if you have a hard-coded implementation, and you realize that you need to change it in a significant way, that’s a good opportunity to extract it and hide it behind an interface.

A good interface choice is a bit tricky here, however. In the original implementation, we could say, “process the batch of numbers (search for counterexamples) between 1 and 2 million.” When that batch is saved to the database, we would start on the next batch, and all the batches would be the same size, so (ignoring that computing the old way takes longer as grows) each batch required roughly the same time to run.

The new search strategy doesn’t have a sensible way to do this. You can’t say “start processing from K” because we don’t know how to easily get from K to the parameter of the enumeration corresponding to K (if one exists). This is partly because our enumeration isn’t monotonic increasing ( comes before ). And partly because even if we did have a scheme, it would almost certainly require us to compute a prime factorization, which is slow. It would be better if we could save the data from the latest step of the enumeration, and load it up when starting the next batch of the search.

This scheme suggests a nicely generic interface for stopping and restarting a search from a particular spot. The definition of a “spot,” and how to start searching from that spot, are what’s hidden by the interface. Here’s a first pass.

SearchState = TypeVar('SearchState') class SearchStrategy(ABC): @abstractmethod def starting_from(self, search_state: SearchState) -> SearchStrategy: '''Reset the search strategy to search from a given state.''' pass @abstractmethod def search_state(self) -> SearchState: '''Get an object describing the current state of the enumeration.''' pass @abstractmethod def next_batch(self, batch_size: int) -> List[RiemannDivisorSum]: '''Process the next batch of Riemann Divisor Sums''' pass

Note that `SearchState`

is defined as a generic type variable because we cannot say anything about its structure yet. The implementation class is responsible for defining what constitutes a search state, and getting the search strategy back to the correct step of the enumeration given the search state as input. Later I realized we do need some structure on the `SearchState`

—the ability to serialize it for storage in the database—so we elevated it to an interface later.

Also note that we are making `SearchStrategy`

own the job of computing the Riemann divisor sums. This is because the enumeration details and the algorithm to compute the divisor sums are now coupled. For the exhaustive search strategy it was “integers n, naively loop over smaller divisors.” In the new strategy it’s “prime factorizations, prime-factorization-based divisor sum.” We could decouple this, but there is little reason to now because the implementations are still in 1-1 correspondence.

This commit implements the old search strategy in terms of this interface, and this commit implements the new search strategy. In the latter, I use `pytest.parameterize`

to test against the interface and parameterize over the implementations.

The last needed bit is the ability to store and recover the search state in between executions of the main program. This requires a second database table. The minimal thing we could do is just store and update a single row for each search strategy, providing the search state as of the last time the program was run and stopped. This would do, but in my opinion an *append-only log* is a better design for such a table. That is, each batch computed will have a record containing the timestamp the batch started and finished, along with the starting and ending search state. We can use the largest timestamp for a given search strategy to pick up where we left off across program runs.

One can imagine this being the basis for an application like folding@home or the BOINC family of projects, where a database stores chunks of a larger computation (ranges of a search space), clients can request chunks to complete, and they are assembled into a complete database. In this case we might want to associate the chunk metadata with the computed results (say, via a foreign key). That would require a bit of work from what we have now, but note that the interfaces would remain reusable for this. For now, we will just incorporate the basic table approach. It is completed in this pull request, and tying it into the main search routine is done in this commit.

However, when running it with the superabundant search strategy, we immediately run into a problem. Superabundant numbers grow too fast, and within a few small batches of size 100 we quickly exceed the 64 bits available to numba and sqlite to store the relevant data.

>>> fac = partition_to_prime_factorization(partitions_of_n(16)[167]) >>> fac2 = [p**d for (p, d) in fac] >>> fac2 [16, 81, 625, 2401, 11, 13, 17, 19, 23, 29, 31, 37] >>> math.log2(reduce(lambda x,y: x*y, fac2)) 65.89743638933722

Running `populate_database.py`

results in the error

$ python -m riemann.populate_database db.sqlite3 SuperabundantSearchStrategy 100 Searching with strategy SuperabundantSearchStrategy Starting from search state SuperabundantEnumerationIndex(level=1, index_in_level=0) Computed [1,0, 10,4] in 0:00:03.618798 Computed [10,4, 12,6] in 0:00:00.031451 Computed [12,6, 13,29] in 0:00:00.031518 Computed [13,29, 14,28] in 0:00:00.041464 Computed [14,28, 14,128] in 0:00:00.041674 Computed [14,128, 15,93] in 0:00:00.034419 ... OverflowError: Python int too large to convert to SQLite INTEGER

We’ll see what we can do about this in a future article, but meanwhile we do get some additional divisor sums for these large numbers, and 10080 is still the best.

sqlite> select n, witness_value from RiemannDivisorSums where witness_value > 1.7 and n > 5040 order by witness_value desc limit 10; 10080|1.7558143389253 55440|1.75124651488749 27720|1.74253672381383 7560|1.73991651920276 15120|1.73855867428903 160626866400|1.73744669257158 321253732800|1.73706925385011 110880|1.73484901030336 6983776800|1.73417642212953 720720|1.73306535623807]]>

As in the previous post, I’ll link to specific git commits in the final code repository to show how the project evolves. You can browse or checkout the repository at each commit to see how it works.

The approach we’ll take is one that highlights the principle of good testing and good software design: separate components by thin interfaces so that the implementations of those interfaces can change later without needing to update lots of client code.

We’ll take this to the extreme by implementing and testing the logic for our application before we ever decide what sort of database we plan to use! In other words, the choice of database will be our *last choice*, making it inherently flexible to change. That is, first we iron out a minimal interface that our application needs, and then choose the right database based on those needs. This is useful because software engineers often don’t understand how the choice of a dependency (especially a database dependency) will work out long term, particularly as a prototype starts to scale and hit application-specific bottlenecks. Couple this with the industry’s trend of chasing hot new fads, and eventually you realize no choice is sacred. Interface separation is the software engineer’s only defense, and their most potent tool for flexibility. As a side note, Tom Gamon summarizes this attitude well in a recent article, borrowing the analogy from a 1975 investment essay The Winner’s Game by Charles Ellis. Some of his other articles reinforce the idea that important decisions should be made as late as possible, since that is the only time you know enough to make those decisions well.

Our application has two parts so far: adding new divisor sums to the database, and loading divisor sums for analysis. Since we’ll be adding to this database over time, it may also be prudent to summarize the contents of the database, e.g. to say what’s the largest computed integer. This suggests the following first-pass interface, implemented in this commit.

class DivisorDb(ABC): @abstractmethod def load() -> List[RiemannDivisorSum]: '''Load the entire database.''' pass @abstractmethod def upsert(data: List[RiemannDivisorSum]) -> None: '''Insert or update data.''' pass @abstractmethod def summarize() -> SummaryStats: '''Summarize the contents of the database.''' pass

`RiemannDivisorSum`

and `SummaryStats`

are dataclasses. These are special classes that are intended to have restricted behavior: storing data and providing simple derivations on that data. For us this provides a stabler interface because the contents of the return values can change over time without interrupting other code. For example, we might want to eventually store the set of divisors alongside their sum. Compare this to returning a list or tuple, which is brittle when used with things like tuple assignment.

The other interesting tidbit about the commit is the use of abstract base classes (“ABC”, an awful name choice). Python has limited support for declaring an “interface” as many other languages do. The pythonic convention was always to use its “duck-typing” feature, which meant to just call whatever methods you want on an object, and then any object that supports has those methods can be used in that spot. The mantra was, “if it walks like a duck and talks like a duck, then it’s a duck.” However, there was no way to say “a duck is any object that has a `waddle`

and `quack`

method, and those are the only allowed duck functions.” As a result, I often saw folks tie their code to one particular duck implementation. That said, there were some mildly cumbersome third party libraries that enabled interface declarations. Better, recent versions of Python introduced the abstract base class as a means to enforce interfaces, and structural subtyping (`typing.Protocol`

) to interact with type hints when subtyping directly is not feasible (e.g., when the source is in different codebases).

Moving on, we can implement an in-memory database that can be used for testing. This is done in this commit. One crucial aspect of these tests is that they do not rely on the knowledge that the in-memory database is secretly a dictionary. That is, the tests use only the `DivisorDb`

interface and never inspect the underlying dict. This allows the same tests to run against all implementations, e.g., using `pytest.parameterize`

. Also note it’s not thread safe or atomic, but for us this doesn’t really matter.

With our first-pass database interface and implementation, we can write the part of the application that populates the database with data. A simple serial algorithm that computes divisor sums in batches of 100k until the user hits Ctrl-C is done in this commit.

def populate_db(db: DivisorDb, batch_size: int = 100000) -> None: '''Populate the db in batches.''' starting_n = (db.summarize().largest_computed_n or 5040) + 1 while True: ending_n = starting_n + batch_size db.upsert(compute_riemann_divisor_sums(starting_n, ending_n)) starting_n = ending_n + 1

I only tested this code manually. The reason is that line 13 (highlighted in the abridged snippet above) is the only significant behavior not already covered by the `InMemoryDivisorDb`

tests. (Of course, that line had a bug later fixed in this commit). I’m also expecting to change it soon, and spending time testing vs implementing features is a tradeoff that should not always fall on the side of testing.

Next let’s swap in a SQL database. We’ll add sqlite3, which comes prepackaged with python, so needs no dependency management. The implementation in this commit uses the same interface as the in-memory database, but the implementation is full of SQL queries. With this, we can upgrade our tests to run identically on both implementations. The commit looks large, but really I just indented all the existing tests, and added the pytest parameterize annotation to the class definition (and corresponding method arguments). This avoids adding a parameterize annotation to every individual test function—which wouldn’t be all that bad, but each new test would require the writer to remember to include the annotation, and this way systematically requires the extra method argument.

And finally, we can switch the database population script to use the SQL database implementation. This is done in this commit. Notice how simple it is, and how it doesn’t require any extra testing.

After running it a few times and getting a database with about 20 million rows, we can apply the simplest possible analysis: showing the top few witness values.

sqlite> select n, witness_value from RiemannDivisorSums where witness_value > 1.7 order by witness_value desc limit 100; 10080|1.7558143389253 55440|1.75124651488749 27720|1.74253672381383 7560|1.73991651920276 15120|1.73855867428903 110880|1.73484901030336 720720|1.73306535623807 1441440|1.72774021157846 166320|1.7269287425473 2162160|1.72557022852613 4324320|1.72354665986337 65520|1.71788900114772 3603600|1.71646721405987 332640|1.71609697536058 10810800|1.71607328780293 7207200|1.71577914933961 30240|1.71395368739173 20160|1.71381061514181 25200|1.71248203640096 83160|1.71210965310318 360360|1.71187211014506 277200|1.71124375582698 2882880|1.7106690212765 12252240|1.70971873843453 12600|1.70953565488377 8648640|1.70941081706371 32760|1.708296575835 221760|1.70824623791406 14414400|1.70288499724944 131040|1.70269370474016 554400|1.70259313608473 1081080|1.70080265951221

We can also confirm John’s claim that “the winners are all multiples of 2520,” as the best non-multiple-of-2520 up to 20 million is 18480, whose witness value is only about 1.69.

This multiple-of-2520 pattern is probably because 2520 is a highly composite number, i.e., it has more divisors than all smaller numbers, so its sum-of-divisors will tend to be large. Digging in a bit further, it seems the smallest counterexample, if it exists, is necessarily a superabundant number. Such numbers have a nice structure described here that suggests a search strategy better than trying every number.

Next time, we can introduce the concept of a search strategy as a new component to the application, and experiment with different search strategies. Other paths forward include building a front-end component, and deploying the system on a server so that the database can be populated continuously.

]]>- There are too many choices with a blank slate.
- Making slightly wrong choices early on causes things to fail in unexpected ways.

I suspect the same concerns apply to general project organization and architecture. Because Python is popular for mathy-programmies, I’ll build a Python project that shows how I organize my projects and and test my code, and how that shapes the design and evolution of my software. I will use Python 3.8 and pytest, and you can find the final code on Github.

For this project, we’ll take advice from John Baez and explore a question that glibly aims to disprove the Riemann Hypothesis:

A CHALLENGE:

Let σ(n) be the sum of divisors of n. There are infinitely many n with σ(n)/(n ln(ln(n)) > 1.781. Can you find one? If you can find n > 5040 with σ(n)/(n ln(ln(n)) > 1.782, you’ll have disproved the Riemann Hypothesis.

I don’t expect you can disprove the Riemann Hypothesis this way, but I’d like to see numbers that make σ(n)/(n ln(ln(n)) big. It seems the winners are all multiples of 2520, so try those. The best one between 5040 and a million is n = 10080, which only gives 1.755814.

https://twitter.com/johncarlosbaez/status/1149700802371608576

One of the hardest parts of software is setting up your coding environment. If you use an integrated development environment (IDE), project setup is bespoke to each IDE. I dislike this approach, because what you learn when using the IDE is not useful outside the IDE. When I first learned to program (Java), I was shackled to Eclipse for years because I didn’t know how to compile and run Java programs without it. Instead, we’ll do everything from scratch, using only the terminal/shell and standard Python tools. I will also ignore random extra steps and minutiae I’ve built up over the years to deal with minor issues. If you’re interested in that and why I do them, leave a comment and I might follow up with a second article.

This article assumes you are familiar with the basics of Python syntax, and know how to open a terminal and enter basic commands (like `ls, cd, mkdir, rm`

). Along the way, I will link to specific git commits that show the changes, so that you can see how the project unfolds with each twist and turn.

I’ll start by creating a fresh Python project that does nothing. We set up the base directory `riemann-divisor-sum`

, initialize git, create a readme, and track it in git (`git add`

+ `git commit`

).

mkdir riemann-divisor-sum cd riemann-divisor-sum git init . echo "# Divisor Sums for the Riemann Hypothesis" > README.md git add README.md git commit -m "add empty README.md"

Next I create a Github project at https://github.com/j2kun/riemann-divisor-sum (the name `riemann-divisor-sum`

does not need to be the same, but I think it’s good), and push the project up to Github.

git remote add origin git@github.com:j2kun/riemann-divisor-sum.git # instead of "master", my default branch is really "main" git push -u origin master

Note, if you’re a new Github user, the “default branch name” when creating a new project may be “master.” I like “main” because it’s shorter, clearer, and nicer. If you want to change your default branch name, you can update to git version 2.28 and add the following to your `~/.gitconfig`

file.

[init] defaultBranch = main

Here is what the project looks like on Github as of this single commit.

Next I’ll install the pytest library which will run our project’s tests. First I’ll show what a failing test looks like, by setting up a trivial program with an un-implemented function, and a corresponding test. For ultimate simplicity, we’ll use Python’s built-in `assert`

for the test lines. Here’s the commit.

# in the terminal mkdir riemann mkdir tests # create riemann/divisor.py containing: '''Compute the sum of divisors of a number.''' def divisor_sum(n: int) -> int: raise ValueError("Not implemented.") # create tests/divisor_test.py containing: from riemann.divisor import divisor_sum def test_sum_of_divisors_of_72(): assert 195 == divisor_sum(72)

Next we install and configure Pytest. At this point, since we’re introducing a dependency, we need a project-specific place to store that dependency. All dependencies related to a project should be explicitly declared and isolated. This page helps explain why. Python’s standard tool is the virtual environment. When you “activate” the virtual environment, it temporarily (for the duration of the shell session or until you run `deactivate`

) points all Python tools and libraries to the virtual environment.

virtualenv -p python3.8 venv source venv/bin/activate # shows the location of the overridden python binary path which python # outputs: /Users/jeremy/riemann-divisor-sum/venv/bin/python

Now we can use pip as normal and it will install to `venv`

. To declare and isolate the dependency, we write the output of `pip freeze`

to a file called `requirements.txt`

, and it can be reinstalled using `pip install -r requirements.txt`

. Try deleting your `venv`

directory, recreating it, and reinstalling the dependencies this way.

pip install pytest pip freeze > requirements.txt git add requirements.txt git commit -m "requirements: add pytest" # example to wipe and reinstall # deactivate # rm -rf venv # virtualenv -p python3.8 venv # source venv/bin/activate # pip install -r requirements.txt

As an aside, at this step you may notice git mentions `venv`

is an untracked directory. You can ignore this, or add `venv`

to a `.gitignore`

file to tell git to ignore it, as in this commit. We will also have to configure pytest to ignore `venv`

shortly.

When we run `pytest`

(with no arguments) from the base directory, we see our first error:

from riemann.divisor import divisor_sum E ModuleNotFoundError: No module named 'riemann'

Module import issues are a common stumbling block for new Python users. In order to make a directory into a Python module, it needs an `__init__.py`

file, even if it’s empty. Any code in this file will be run the first time the module is imported in a Python runtime. We add one to both the code and test directories in this commit.

When we run pytest (with no arguments), it recursively searches the directory tree looking for files like *_test.py and test_*.py loads them, and treats every method inside those files that are prefixed with “test” as a test. Non-“test” methods can be defined and used as helpers to set up complex tests. Pytest then runs the tests, and reports the failures. For me this looks like

Our implementation is intentionally wrong for demonstration purposes. When a test passes, pytest will report it quietly as a “.” by default. See these docs for more info on different ways to run the pytest binary and configure its output report.

In this basic pytest setup, you can put test files wherever you want, name the files and test methods appropriately, and use assert to implement the tests themselves. As long as your modules are set up properly, as long as imports are absolute (see this page for gory details on absolute vs. relative imports), and as long as you run pytest from the base directory, pytest will find the tests and run them.

Since pytest searches *all* directories for tests, this includes `venv`

and `__pycache__`

, which magically appears when you create python modules (I add `__pycache__`

to gitignore). Sometimes package developers will include test code, and pytest will then run those tests, which often fail or clutter the output. A virtual environment also gets large as you install big dependencies (like numpy, scipy, pandas), so this makes pytest slow to search for tests to run. To alleviate, the `--norecursedirs`

command line flag tells pytest to skip directories. Since it’s tedious to type `--norecursedirs='venv __pycache__'`

every time you run pytest, you can make this the default behavior by storing the option in a configuration file recognized by pytest, such as setup.cfg. I did it in this commit.

Some other command line options that I use all the time:

`pytest test/dir`

to test only files in that directory, or`pytest test/dir/test_file.py`

to test only tests in that file.`pytest -k STR`

to only run tests whose name contains “STR”`pytest -s`

to see see any logs or print statements inside tested code`pytest -s`

to allow the pdb/ipdb debugger to function and step through a failing test.

Now let’s build up the project. My general flow is as follows:

- Decide what work to do next.
- Sketch out the interface for that work.
- Write some basic (failing, usually lightweight) tests that will pass when the work is done.
- Do the work.
- Add more nuanced tests if needed, based on what is learned during the work.
- Repeat until the work is done.

This strategy is sometimes called “the design recipe,” and I first heard about it from my undergraduate programming professor John Clements at Cal Poly, via the book “How to Design Programs.” Even if I don’t always use it, I find it’s a useful mental framework for getting things done.

For this project, I want to search through positive integers, and for each one I want to compute a divisor sum, do some other arithmetic, and compare that against some other number. I suspect divisor sum computations will be the hard/interesting part, but to start I will code up a slow/naive implementation with some working tests, confirm my understanding of the end-to-end problem, and then improve the pieces as needed.

In this commit, I implement the naive divisor sum code and tests. Note the commit also shows how to tell pytest to test for a raised exception. In this commit I implement the main search routine and confirm John’s claim about (thanks for the test case!).

These tests already showcase a few testing best practices:

**Test only one behavior at a time.**Each test has exactly one assertion in it. This is good practice because when a test fails you won’t have to dig around to figure out exactly what went wrong.**Use the tests to help you define the interface, and then only test against that interface.**The hard part about writing clean and clear software is defining clean and clear interfaces that work together well and hide details. Math does this very well, because definitions like do not depend on how is represented. In fact, math*really doesn’t*have “representations” of its objects—or more precisely, switching representations is basically free, so we don’t dwell on it. In software, we have to choose excruciatingly detailed representations for everything, and so we rely on the software to hide those details as much as possible. The easiest way to tell if you did it well is to try to use the interface and*only*the interface, and tests are an excuse to do that, which is not wasted effort by virtue of being run to check your work.

Next, I want to confirm John’s claim that is the *best *example between 5041 and a million. However, my existing code is too slow. Running the tests added in this commit seems to take forever.

We profile to confirm our suspected hotspot:

>>> import cProfile >>> from riemann.counterexample_search import best_witness >>> cProfile.run('best_witness(10000)') ncalls tottime percall cumtime percall filename:lineno(function) ... 54826 3.669 0.000 3.669 0.000 divisor.py:10(<genexpr>)

As expected, computing divisor sums is the bottleneck. No surprise there because it makes the search take quadratic time. Before changing the implementation, I want to add a few more tests. I copied data for the first 50 integers from OEIS and used pytest’s parameterize feature since the test bodies are all the same. This commit does it.

Now I can work on improving the runtime of the divisor sum computation step. Originally, I thought I’d have to compute the prime factorization to use this trick that exploits the multiplicativity of , but then I found this approach due to Euler in 1751 that provides a recursive formula for the sum and skips the prime factorization. Since we’re searching over all integers, this allows us to trade off the runtime of each computation against the storage cost of past computations. I tried it in this commit, using python’s built-in LRU-cache wrapper to memoize the computation. The nice thing about this is that our tests are already there, and the interface for `divisor_sum`

doesn’t change. This is on purpose, so that the caller of `divisor_sum`

(in this case tests, also client code in real life) need not update when we improve the implementation. I also ran into a couple of stumbling blocks implementing the algorithm (I swapped the order of the if statements here), and the tests made it clear I messed up.

However, there are two major problems with that implementation.

- The code is still too slow.
`best_witness(100000)`

takes about 50 seconds to run, almost all of which is in`divisor_sum`

. - Python hits its recursion depth limit, and so the client code needs to eagerly populate the
`divisor_sum`

cache, which is violates encapsulation. The caller should not know anything about the implementation, nor need to act in a specific way to accommodate hidden implementation details.

I also realized after implementing it that despite the extra storage space, the runtime is still , because each divisor-sum call requires iterations of the loop. This is just as slow as a naive loop that checks divisibility of integers up to . Also, a naive loop allows me to plug in a cool project called numba that automatically speeds up simple Python code by compiling it in place. Incidentally, numba is known to not work with `lru_cache`

, so I can’t tack it on my existing implementation.

So I added numba as a dependency and drastically simplified the implementation. Now the tests run in 8 seconds, and in a few minutes I can upgrade John’s claim that is the best* *example between 5041 and a million, to the best example between 5041 and ten million.

This should get you started with a solid pytest setup for your own project, but there is a lot more to say about how to organize and run tests, what kinds of tests to write, and how that all changes as your project evolves.

For this project, we now know that the divisor-sum computation is the bottleneck. We also know that the interesting parts of this project are yet to come. We want to explore the patterns in what makes these numbers large. One way we could go about this is to split the project into two components: one that builds/manages a database of divisor sums, and another that analyzes the divisor sums in various ways. The next article will show how the database set up works. When we identify relevant patterns, we can modify the search strategy to optimize for that. As far as testing goes, this would prompt us to have an interface layer between the two systems, and to add fakes or mocks to test the components in isolation.

After that, there’s the process of automating test running, adding tests for code quality/style, computing code coverage, adding a type-hint checker test, writing tests that generate other tests, etc.

If you’re interested, let me know which topics to continue with. I do feel a bit silly putting so much pomp and circumstance around such a simple computation, but hopefully the simplicity of the core logic makes the design and testing aspects of the project clearer and easier to understand.

]]>