We’re glibly searching for counterexamples to the Riemann Hypothesis, to trick you into learning about software engineering principles. In the first two articles we configured a testing framework and showed how to hide implementation choices behind an interface. Next, we’ll improve the algorithm’s core routine. As before, I’ll link to specific git commits in the final code repository to show how the project evolves.
A superabundant number is one which has “maximal relative divisor sums” in the following sense: for all ,
where is the sum of the divisors of .
Erdős and Alaoglu proved in 1944 (“On highly composite and similar numbers“) that superabundant numbers have a specific prime decomposition, in which all initial primes occur with non-increasing exponents
where is the i-th prime, and . With two exceptions (), .
Here’s a rough justification for why superabundant numbers should have a decomposition like this. If you want a number with many divisors (compared to the size of the number), you want to pack as many combinations of small primes into the decomposition of your number as possible. Using all 2’s leads to not enough combinations—only divisors for —but using 2′ and 3’s you get for . Using more 3’s trades off a larger number for the benefit of a larger (up to ). The balance between getting more distinct factor combinations and a larger favors packing the primes in there.
Though numbers of this form are not necessarily superabundant, this gives us an enumeration strategy better than trying all numbers. Enumerate over tuples corresponding to the exponents of the prime decomposition (non-increasing lists of integers), and save those primes to make it easier to compute the divisor sum.
Non-increasing lists of integers can be enumerated in the order of their sum, and for each sum , the set of non-increasing lists of integers summing to is called the partitions of . There is a simple algorithm to compute them, implemented in this commit. Note this does not enumerate them in order of the magnitude of the number .
The implementation for the prime-factorization-based divisor sum computation is in this commit. In addition, to show some alternative methods of testing, we used the hypothesis library to autogenerate tests. It chooses a random (limited size) prime factorization, and compares the prime-factorization-based algorithm to the naive algorithm. There’s a bit of setup code involved, but as a result we get dozens of tests and more confidence it’s right.
We now have two search strategies over the space of natural numbers, though one is obviously better. We may come up with a third, so it makes sense to separate the search strategy from the main application by an interface. Generally, if you have a hard-coded implementation, and you realize that you need to change it in a significant way, that’s a good opportunity to extract it and hide it behind an interface.
A good interface choice is a bit tricky here, however. In the original implementation, we could say, “process the batch of numbers (search for counterexamples) between 1 and 2 million.” When that batch is saved to the database, we would start on the next batch, and all the batches would be the same size, so (ignoring that computing the old way takes longer as grows) each batch required roughly the same time to run.
The new search strategy doesn’t have a sensible way to do this. You can’t say “start processing from K” because we don’t know how to easily get from K to the parameter of the enumeration corresponding to K (if one exists). This is partly because our enumeration isn’t monotonic increasing ( comes before ). And partly because even if we did have a scheme, it would almost certainly require us to compute a prime factorization, which is slow. It would be better if we could save the data from the latest step of the enumeration, and load it up when starting the next batch of the search.
This scheme suggests a nicely generic interface for stopping and restarting a search from a particular spot. The definition of a “spot,” and how to start searching from that spot, are what’s hidden by the interface. Here’s a first pass.
SearchState = TypeVar('SearchState') class SearchStrategy(ABC): @abstractmethod def starting_from(self, search_state: SearchState) -> SearchStrategy: '''Reset the search strategy to search from a given state.''' pass @abstractmethod def search_state(self) -> SearchState: '''Get an object describing the current state of the enumeration.''' pass @abstractmethod def next_batch(self, batch_size: int) -> List[RiemannDivisorSum]: '''Process the next batch of Riemann Divisor Sums''' pass
SearchState is defined as a generic type variable because we cannot say anything about its structure yet. The implementation class is responsible for defining what constitutes a search state, and getting the search strategy back to the correct step of the enumeration given the search state as input. Later I realized we do need some structure on the
SearchState—the ability to serialize it for storage in the database—so we elevated it to an interface later.
Also note that we are making
SearchStrategy own the job of computing the Riemann divisor sums. This is because the enumeration details and the algorithm to compute the divisor sums are now coupled. For the exhaustive search strategy it was “integers n, naively loop over smaller divisors.” In the new strategy it’s “prime factorizations, prime-factorization-based divisor sum.” We could decouple this, but there is little reason to now because the implementations are still in 1-1 correspondence.
This commit implements the old search strategy in terms of this interface, and this commit implements the new search strategy. In the latter, I use
pytest.parameterize to test against the interface and parameterize over the implementations.
The last needed bit is the ability to store and recover the search state in between executions of the main program. This requires a second database table. The minimal thing we could do is just store and update a single row for each search strategy, providing the search state as of the last time the program was run and stopped. This would do, but in my opinion an append-only log is a better design for such a table. That is, each batch computed will have a record containing the timestamp the batch started and finished, along with the starting and ending search state. We can use the largest timestamp for a given search strategy to pick up where we left off across program runs.
One can imagine this being the basis for an application like folding@home or the BOINC family of projects, where a database stores chunks of a larger computation (ranges of a search space), clients can request chunks to complete, and they are assembled into a complete database. In this case we might want to associate the chunk metadata with the computed results (say, via a foreign key). That would require a bit of work from what we have now, but note that the interfaces would remain reusable for this. For now, we will just incorporate the basic table approach. It is completed in this pull request, and tying it into the main search routine is done in this commit.
However, when running it with the superabundant search strategy, we immediately run into a problem. Superabundant numbers grow too fast, and within a few small batches of size 100 we quickly exceed the 64 bits available to numba and sqlite to store the relevant data.
>>> fac = partition_to_prime_factorization(partitions_of_n(16)) >>> fac2 = [p**d for (p, d) in fac] >>> fac2 [16, 81, 625, 2401, 11, 13, 17, 19, 23, 29, 31, 37] >>> math.log2(reduce(lambda x,y: x*y, fac2)) 65.89743638933722
populate_database.py results in the error
$ python -m riemann.populate_database db.sqlite3 SuperabundantSearchStrategy 100 Searching with strategy SuperabundantSearchStrategy Starting from search state SuperabundantEnumerationIndex(level=1, index_in_level=0) Computed [1,0, 10,4] in 0:00:03.618798 Computed [10,4, 12,6] in 0:00:00.031451 Computed [12,6, 13,29] in 0:00:00.031518 Computed [13,29, 14,28] in 0:00:00.041464 Computed [14,28, 14,128] in 0:00:00.041674 Computed [14,128, 15,93] in 0:00:00.034419 ... OverflowError: Python int too large to convert to SQLite INTEGER
We’ll see what we can do about this in a future article, but meanwhile we do get some additional divisor sums for these large numbers, and 10080 is still the best.
sqlite> select n, witness_value from RiemannDivisorSums where witness_value > 1.7 and n > 5040 order by witness_value desc limit 10; 10080|1.7558143389253 55440|1.75124651488749 27720|1.74253672381383 7560|1.73991651920276 15120|1.73855867428903 160626866400|1.73744669257158 321253732800|1.73706925385011 110880|1.73484901030336 6983776800|1.73417642212953 720720|1.73306535623807