Searching for RH Counterexamples — Adding a Database

In the last article we set up pytest for a simple application that computes divisor sums \sigma(n) and tries to disprove the Riemann Hypothesis. In this post we’ll show how to extend the application as we add a database dependency. The database stores the computed sums so we can analyze them after our application finishes.

As in the previous post, I’ll link to specific git commits in the final code repository to show how the project evolves. You can browse or checkout the repository at each commit to see how it works.

Interface before implementation

The approach we’ll take is one that highlights the principle of good testing and good software design: separate components by thin interfaces so that the implementations of those interfaces can change later without needing to update lots of client code.

We’ll take this to the extreme by implementing and testing the logic for our application before we ever decide what sort of database we plan to use! In other words, the choice of database will be our last choice, making it inherently flexible to change. That is, first we iron out a minimal interface that our application needs, and then choose the right database based on those needs. This is useful because software engineers often don’t understand how the choice of a dependency (especially a database dependency) will work out long term, particularly as a prototype starts to scale and hit application-specific bottlenecks. Couple this with the industry’s trend of chasing hot new fads, and eventually you realize no choice is sacred. Interface separation is the software engineer’s only defense, and their most potent tool for flexibility. As a side note, Tom Gamon summarizes this attitude well in a recent article, borrowing the analogy from a 1975 investment essay The Winner’s Game by Charles Ellis. Some of his other articles reinforce the idea that important decisions should be made as late as possible, since that is the only time you know enough to make those decisions well.

Our application has two parts so far: adding new divisor sums to the database, and loading divisor sums for analysis. Since we’ll be adding to this database over time, it may also be prudent to summarize the contents of the database, e.g. to say what’s the largest computed integer. This suggests the following first-pass interface, implemented in this commit.

class DivisorDb(ABC):
    @abstractmethod
    def load() -> List[RiemannDivisorSum]:
        '''Load the entire database.'''
        pass

    @abstractmethod
    def upsert(data: List[RiemannDivisorSum]) -> None:
        '''Insert or update data.'''
        pass

    @abstractmethod
    def summarize() -> SummaryStats:
        '''Summarize the contents of the database.'''
        pass

RiemannDivisorSum and SummaryStats are dataclasses. These are special classes that are intended to have restricted behavior: storing data and providing simple derivations on that data. For us this provides a stabler interface because the contents of the return values can change over time without interrupting other code. For example, we might want to eventually store the set of divisors alongside their sum. Compare this to returning a list or tuple, which is brittle when used with things like tuple assignment.

The other interesting tidbit about the commit is the use of abstract base classes (“ABC”, an awful name choice). Python has limited support for declaring an “interface” as many other languages do. The pythonic convention was always to use its “duck-typing” feature, which meant to just call whatever methods you want on an object, and then any object that supports has those methods can be used in that spot. The mantra was, “if it walks like a duck and talks like a duck, then it’s a duck.” However, there was no way to say “a duck is any object that has a waddle and quack method, and those are the only allowed duck functions.” As a result, I often saw folks tie their code to one particular duck implementation. That said, there were some mildly cumbersome third party libraries that enabled interface declarations. Better, recent versions of Python introduced the abstract base class as a means to enforce interfaces, and structural subtyping (typing.Protocol) to interact with type hints when subtyping directly is not feasible (e.g., when the source is in different codebases).

Moving on, we can implement an in-memory database that can be used for testing. This is done in this commit. One crucial aspect of these tests is that they do not rely on the knowledge that the in-memory database is secretly a dictionary. That is, the tests use only the DivisorDb interface and never inspect the underlying dict. This allows the same tests to run against all implementations, e.g., using pytest.parameterize. Also note it’s not thread safe or atomic, but for us this doesn’t really matter.

Injecting the Interface

With our first-pass database interface and implementation, we can write the part of the application that populates the database with data. A simple serial algorithm that computes divisor sums in batches of 100k until the user hits Ctrl-C is done in this commit.

def populate_db(db: DivisorDb, batch_size: int = 100000) -> None:
    '''Populate the db in batches.'''
    starting_n = (db.summarize().largest_computed_n or 5040) + 1
    while True:
        ending_n = starting_n + batch_size
        db.upsert(compute_riemann_divisor_sums(starting_n, ending_n))
        starting_n = ending_n + 1

I only tested this code manually. The reason is that line 13 (highlighted in the abridged snippet above) is the only significant behavior not already covered by the InMemoryDivisorDb tests. (Of course, that line had a bug later fixed in this commit). I’m also expecting to change it soon, and spending time testing vs implementing features is a tradeoff that should not always fall on the side of testing.

Next let’s swap in a SQL database. We’ll add sqlite3, which comes prepackaged with python, so needs no dependency management. The implementation in this commit uses the same interface as the in-memory database, but the implementation is full of SQL queries. With this, we can upgrade our tests to run identically on both implementations. The commit looks large, but really I just indented all the existing tests, and added the pytest parameterize annotation to the class definition (and corresponding method arguments). This avoids adding a parameterize annotation to every individual test function—which wouldn’t be all that bad, but each new test would require the writer to remember to include the annotation, and this way systematically requires the extra method argument.

And finally, we can switch the database population script to use the SQL database implementation. This is done in this commit. Notice how simple it is, and how it doesn’t require any extra testing.

After running it a few times and getting a database with about 20 million rows, we can apply the simplest possible analysis: showing the top few witness values.

sqlite> select n, witness_value from RiemannDivisorSums where witness_value > 1.7 order by witness_value desc limit 100;
10080|1.7558143389253
55440|1.75124651488749
27720|1.74253672381383
7560|1.73991651920276
15120|1.73855867428903
110880|1.73484901030336
720720|1.73306535623807
1441440|1.72774021157846
166320|1.7269287425473
2162160|1.72557022852613
4324320|1.72354665986337
65520|1.71788900114772
3603600|1.71646721405987
332640|1.71609697536058
10810800|1.71607328780293
7207200|1.71577914933961
30240|1.71395368739173
20160|1.71381061514181
25200|1.71248203640096
83160|1.71210965310318
360360|1.71187211014506
277200|1.71124375582698
2882880|1.7106690212765
12252240|1.70971873843453
12600|1.70953565488377
8648640|1.70941081706371
32760|1.708296575835
221760|1.70824623791406
14414400|1.70288499724944
131040|1.70269370474016
554400|1.70259313608473
1081080|1.70080265951221

We can also confirm John’s claim that “the winners are all multiples of 2520,” as the best non-multiple-of-2520 up to 20 million is 18480, whose witness value is only about 1.69.

This multiple-of-2520 pattern is probably because 2520 is a highly composite number, i.e., it has more divisors than all smaller numbers, so its sum-of-divisors will tend to be large. Digging in a bit further, it seems the smallest counterexample, if it exists, is necessarily a superabundant number. Such numbers have a nice structure described here that suggests a search strategy better than trying every number.

Next time, we can introduce the concept of a search strategy as a new component to the application, and experiment with different search strategies. Other paths forward include building a front-end component, and deploying the system on a server so that the database can be populated continuously.

Searching for RH Counterexamples — Setting up Pytest

Some mathy-programmy people tell me they want to test their code, but struggle to get set up with a testing framework. I suspect it’s due to a mix of:

  • There are too many choices with a blank slate.
  • Making slightly wrong choices early on causes things to fail in unexpected ways.

I suspect the same concerns apply to general project organization and architecture. Because Python is popular for mathy-programmies, I’ll build a Python project that shows how I organize my projects and and test my code, and how that shapes the design and evolution of my software. I will use Python 3.8 and pytest, and you can find the final code on Github.

For this project, we’ll take advice from John Baez and explore a question that glibly aims to disprove the Riemann Hypothesis:

A CHALLENGE:

Let σ(n) be the sum of divisors of n. There are infinitely many n with σ(n)/(n ln(ln(n)) > 1.781. Can you find one? If you can find n > 5040 with σ(n)/(n ln(ln(n)) > 1.782, you’ll have disproved the Riemann Hypothesis.

I don’t expect you can disprove the Riemann Hypothesis this way, but I’d like to see numbers that make σ(n)/(n ln(ln(n)) big. It seems the winners are all multiples of 2520, so try those. The best one between 5040 and a million is n = 10080, which only gives 1.755814.

https://twitter.com/johncarlosbaez/status/1149700802371608576

Initializing the Project

One of the hardest parts of software is setting up your coding environment. If you use an integrated development environment (IDE), project setup is bespoke to each IDE. I dislike this approach, because what you learn when using the IDE is not useful outside the IDE. When I first learned to program (Java), I was shackled to Eclipse for years because I didn’t know how to compile and run Java programs without it. Instead, we’ll do everything from scratch, using only the terminal/shell and standard Python tools. I will also ignore random extra steps and minutiae I’ve built up over the years to deal with minor issues. If you’re interested in that and why I do them, leave a comment and I might follow up with a second article.

This article assumes you are familiar with the basics of Python syntax, and know how to open a terminal and enter basic commands (like ls, cd, mkdir, rm). Along the way, I will link to specific git commits that show the changes, so that you can see how the project unfolds with each twist and turn.

I’ll start by creating a fresh Python project that does nothing. We set up the base directory riemann-divisor-sum, initialize git, create a readme, and track it in git (git add + git commit).

mkdir riemann-divisor-sum
cd riemann-divisor-sum
git init .
echo "# Divisor Sums for the Riemann Hypothesis" > README.md
git add README.md
git commit -m "add empty README.md"

Next I create a Github project at https://github.com/j2kun/riemann-divisor-sum (the name riemann-divisor-sum does not need to be the same, but I think it’s good), and push the project up to Github.

git remote add origin git@github.com:j2kun/riemann-divisor-sum.git
# instead of "master", my default branch is really "main"
git push -u origin master   

Note, if you’re a new Github user, the “default branch name” when creating a new project may be “master.” I like “main” because it’s shorter, clearer, and nicer. If you want to change your default branch name, you can update to git version 2.28 and add the following to your ~/.gitconfig file.

[init]
    defaultBranch = main

Here is what the project looks like on Github as of this single commit.

Pytest

Next I’ll install the pytest library which will run our project’s tests. First I’ll show what a failing test looks like, by setting up a trivial program with an un-implemented function, and a corresponding test. For ultimate simplicity, we’ll use Python’s built-in assert for the test lines. Here’s the commit.

# in the terminal
mkdir riemann
mkdir tests


# create riemann/divisor.py containing:
'''Compute the sum of divisors of a number.'''

def divisor_sum(n: int) -> int:
    raise ValueError("Not implemented.")


# create tests/divisor_test.py containing:
from riemann.divisor import divisor_sum

def test_sum_of_divisors_of_72():
    assert 195 == divisor_sum(72)

Next we install and configure Pytest. At this point, since we’re introducing a dependency, we need a project-specific place to store that dependency. All dependencies related to a project should be explicitly declared and isolated. This page helps explain why. Python’s standard tool is the virtual environment. When you “activate” the virtual environment, it temporarily (for the duration of the shell session or until you run deactivate) points all Python tools and libraries to the virtual environment.

virtualenv -p python3.8 venv
source venv/bin/activate

# shows the location of the overridden python binary path
which python
# outputs: /Users/jeremy/riemann-divisor-sum/venv/bin/python

Now we can use pip as normal and it will install to venv. To declare and isolate the dependency, we write the output of pip freeze to a file called requirements.txt, and it can be reinstalled using pip install -r requirements.txt. Try deleting your venv directory, recreating it, and reinstalling the dependencies this way.

pip install pytest
pip freeze > requirements.txt
git add requirements.txt
git commit -m "requirements: add pytest"

# example to wipe and reinstall
# deactivate
# rm -rf venv
# virtualenv -p python3.8 venv
# source venv/bin/activate
# pip install -r requirements.txt

As an aside, at this step you may notice git mentions venv is an untracked directory. You can ignore this, or add venv to a .gitignore file to tell git to ignore it, as in this commit. We will also have to configure pytest to ignore venv shortly.

When we run pytest (with no arguments) from the base directory, we see our first error:

    from riemann.divisor import divisor_sum
E   ModuleNotFoundError: No module named 'riemann'

Module import issues are a common stumbling block for new Python users. In order to make a directory into a Python module, it needs an __init__.py file, even if it’s empty. Any code in this file will be run the first time the module is imported in a Python runtime. We add one to both the code and test directories in this commit.

When we run pytest (with no arguments), it recursively searches the directory tree looking for files like *_test.py and test_*.py loads them, and treats every method inside those files that are prefixed with “test” as a test. Non-“test” methods can be defined and used as helpers to set up complex tests. Pytest then runs the tests, and reports the failures. For me this looks like

Our first test failure.

Our implementation is intentionally wrong for demonstration purposes. When a test passes, pytest will report it quietly as a “.” by default. See these docs for more info on different ways to run the pytest binary and configure its output report.

In this basic pytest setup, you can put test files wherever you want, name the files and test methods appropriately, and use assert to implement the tests themselves. As long as your modules are set up properly, as long as imports are absolute (see this page for gory details on absolute vs. relative imports), and as long as you run pytest from the base directory, pytest will find the tests and run them.

Since pytest searches all directories for tests, this includes venv and __pycache__, which magically appears when you create python modules (I add __pycache__ to gitignore). Sometimes package developers will include test code, and pytest will then run those tests, which often fail or clutter the output. A virtual environment also gets large as you install big dependencies (like numpy, scipy, pandas), so this makes pytest slow to search for tests to run. To alleviate, the --norecursedirs command line flag tells pytest to skip directories. Since it’s tedious to type --norecursedirs='venv __pycache__' every time you run pytest, you can make this the default behavior by storing the option in a configuration file recognized by pytest, such as setup.cfg. I did it in this commit.

Some other command line options that I use all the time:

  • pytest test/dir to test only files in that directory, or pytest test/dir/test_file.py to test only tests in that file.
  • pytest -k STR to only run tests whose name contains “STR”
  • pytest -s to see see any logs or print statements inside tested code
  • pytest -s to allow the pdb/ipdb debugger to function and step through a failing test.

Building up the project

Now let’s build up the project. My general flow is as follows:

  1. Decide what work to do next.
  2. Sketch out the interface for that work.
  3. Write some basic (failing, usually lightweight) tests that will pass when the work is done.
  4. Do the work.
  5. Add more nuanced tests if needed, based on what is learned during the work.
  6. Repeat until the work is done.

This strategy is sometimes called “the design recipe,” and I first heard about it from my undergraduate programming professor John Clements at Cal Poly, via the book “How to Design Programs.” Even if I don’t always use it, I find it’s a useful mental framework for getting things done.

For this project, I want to search through positive integers, and for each one I want to compute a divisor sum, do some other arithmetic, and compare that against some other number. I suspect divisor sum computations will be the hard/interesting part, but to start I will code up a slow/naive implementation with some working tests, confirm my understanding of the end-to-end problem, and then improve the pieces as needed.

In this commit, I implement the naive divisor sum code and tests. Note the commit also shows how to tell pytest to test for a raised exception. In this commit I implement the main search routine and confirm John’s claim about n=10080 (thanks for the test case!).

These tests already showcase a few testing best practices:

  • Test only one behavior at a time. Each test has exactly one assertion in it. This is good practice because when a test fails you won’t have to dig around to figure out exactly what went wrong.
  • Use the tests to help you define the interface, and then only test against that interface. The hard part about writing clean and clear software is defining clean and clear interfaces that work together well and hide details. Math does this very well, because definitions like \sigma(n) do not depend on how n is represented. In fact, math really doesn’t have “representations” of its objects—or more precisely, switching representations is basically free, so we don’t dwell on it. In software, we have to choose excruciatingly detailed representations for everything, and so we rely on the software to hide those details as much as possible. The easiest way to tell if you did it well is to try to use the interface and only the interface, and tests are an excuse to do that, which is not wasted effort by virtue of being run to check your work.

Improving Efficiency

Next, I want to confirm John’s claim that n=10080 is the best example between 5041 and a million. However, my existing code is too slow. Running the tests added in this commit seems to take forever.

We profile to confirm our suspected hotspot:

>>> import cProfile
>>> from riemann.counterexample_search import best_witness
>>> cProfile.run('best_witness(10000)')
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
54826    3.669    0.000    3.669    0.000 divisor.py:10(<genexpr>)

As expected, computing divisor sums is the bottleneck. No surprise there because it makes the search take quadratic time. Before changing the implementation, I want to add a few more tests. I copied data for the first 50 integers from OEIS and used pytest’s parameterize feature since the test bodies are all the same. This commit does it.

Now I can work on improving the runtime of the divisor sum computation step. Originally, I thought I’d have to compute the prime factorization to use this trick that exploits the multiplicativity of \sigma(n), but then I found this approach due to Euler in 1751 that provides a recursive formula for the sum and skips the prime factorization. Since we’re searching over all integers, this allows us to trade off the runtime of each \sigma(n) computation against the storage cost of past \sigma(n) computations. I tried it in this commit, using python’s built-in LRU-cache wrapper to memoize the computation. The nice thing about this is that our tests are already there, and the interface for divisor_sum doesn’t change. This is on purpose, so that the caller of divisor_sum (in this case tests, also client code in real life) need not update when we improve the implementation. I also ran into a couple of stumbling blocks implementing the algorithm (I swapped the order of the if statements here), and the tests made it clear I messed up.

However, there are two major problems with that implementation.

  1. The code is still too slow. best_witness(100000) takes about 50 seconds to run, almost all of which is in divisor_sum.
  2. Python hits its recursion depth limit, and so the client code needs to eagerly populate the divisor_sum cache, which is violates encapsulation. The caller should not know anything about the implementation, nor need to act in a specific way to accommodate hidden implementation details.

I also realized after implementing it that despite the extra storage space, the runtime is still O(n^{3/2}), because each divisor-sum call requires O(n^{1/2}) iterations of the loop. This is just as slow as a naive loop that checks divisibility of integers up to \sqrt{n}. Also, a naive loop allows me to plug in a cool project called numba that automatically speeds up simple Python code by compiling it in place. Incidentally, numba is known to not work with lru_cache, so I can’t tack it on my existing implementation.

So I added numba as a dependency and drastically simplified the implementation. Now the tests run in 8 seconds, and in a few minutes I can upgrade John’s claim that n=10080 is the best example between 5041 and a million, to the best example between 5041 and ten million.

Next up

This should get you started with a solid pytest setup for your own project, but there is a lot more to say about how to organize and run tests, what kinds of tests to write, and how that all changes as your project evolves.

For this project, we now know that the divisor-sum computation is the bottleneck. We also know that the interesting parts of this project are yet to come. We want to explore the patterns in what makes these numbers large. One way we could go about this is to split the project into two components: one that builds/manages a database of divisor sums, and another that analyzes the divisor sums in various ways. The next article will show how the database set up works. When we identify relevant patterns, we can modify the search strategy to optimize for that. As far as testing goes, this would prompt us to have an interface layer between the two systems, and to add fakes or mocks to test the components in isolation.

After that, there’s the process of automating test running, adding tests for code quality/style, computing code coverage, adding a type-hint checker test, writing tests that generate other tests, etc.

If you’re interested, let me know which topics to continue with. I do feel a bit silly putting so much pomp and circumstance around such a simple computation, but hopefully the simplicity of the core logic makes the design and testing aspects of the project clearer and easier to understand.

Taylor Series and Accelerometers

In my book, A Programmer’s Introduction to Mathematics, I describe the Taylor Series as a “hammer for every nail.” I learned about another nail in the design of modern smartphone accelerometers from “Eight Amazing Engineering Stories” by Hammack, Ryan, and Ziech, which I’ll share here.

These accelerometers are designed using a system involving three plates, which correspond to two capacitors. A quick recap on my (limited) understanding of how capacitors work. A capacitor involving two conductive plates looks like this:

A two-plate capacitor hooked up to a battery. (source)

The voltage provided by the battery pushes electrons along the negative direction, or equivalently pushing “charge” along the positive direction (see the difference between charge flow and election flow). These elections build up in the plate labeled -Q, and the difference in charge across the two plates generates an electric field. If that electric field is strong enough, the electrons can jump the gap to the positive plate and complete the circuit. Otherwise, the plate reaches “capacity” and current stops flowing. Whether the jump happens or the current stops depends on the area of the plate A, the distance between the plates d, and the properties of the material between the plates, the last one is called the “dielectric constant” \varepsilon. (Nb., I’m not sure why it doesn’t depend on the material the plate is composed of, but I imagine it’s smooshed into the dielectric constant if necessary) This relationship is summarized by the formula

\displaystyle C = \frac{\varepsilon A}{d}

Then, an external event can cause the plates to move close enough together so that the electrons can jump the gap and current can begin to flow. This discharges the negatively charged plate.

A naive, Taylor-series-free accelerometer could work as follows:

  1. Allow the negatively charged plate to wobble a little bit by fixing just one end of the plate, pictured like a diving board (a cantilever).
  2. The amount of wobble will be proportional to the force of acceleration due to Hooke’s law for springs.
  3. When displaced by a distance of \delta, the capacitance in the plate changes to C = \frac{\varepsilon A}{d - \delta}.
  4. Use the amount of discharge to tell how much the plate displaced.

This is able to measure the force of acceleration in one dimension, and so thee of these devices are arranged in perpendicular axes to allow one to measure acceleration in 3-dimensional space.

The problem with this design is that C = \frac{\varepsilon A}{d - \delta} is a nonlinear change in capacitance with respect to the amount of displacement. To see how nonlinear, expand this as a Taylor series:

\displaystyle \begin{aligned} C &= \frac{\varepsilon A}{d - \delta} \\ &= \frac{\varepsilon A}{d} \left ( \frac{1}{1 -  \frac{\delta}{d}} \right ) \\ &= \frac{\varepsilon A}{d} \left ( 1 + \frac{\delta}{d} + \left ( \frac{\delta}{d} \right )^2 + O_{\delta \to 0}(\delta^3) \right ) \end{aligned}

I’m using the big-O notation O_{\delta \to 0}(\delta^3) to more rigorously say that I’m “ignoring” all cubic and higher terms. I can do this because in these engineering systems (I’m taking Hammack at his word here), the quantity (\delta / d)^2 is meaningfully large, but later terms like (\delta / d)^3 are negligibly small. Of course, this is only true when the displacement \delta is very small compared to d, which is why the big-O has a subscript \delta \to 0.

Apparently, working backwards through the nonlinearity in the capacitance change is difficult enough to warrant changing the design of the system. (I don’t know why this is difficult, but I imagine it has to do with the engineering constraints of measurement devices; please do chime in if you know more)

The system design that avoids this is a three-plate system instead of a two-plate system.

A three-plate system design. The middle plate wobbles between two charged plates. (source)

In this system, the middle plate moves back and forth between two stationary plates that are connected to a voltage source. As it moves away from one and closer to the other, the increased capacitance on one side is balanced by the decreased capacitance on the other. The Taylor series shows how these two changes cancel out on the squared term only.

If C_1 = \frac{\varepsilon A}{d - \delta} represents the changed capacitance of the left plate (the middle plate moves closer to it), and C_2 = \frac{\varepsilon A}{d + \delta} represents the right plate (the middle plate moves farther from it), then we expand the difference in capacitances via Taylor series (using the Taylor series for 1/(1-x) for both, but in the 1 + \delta/d case it’s 1 / (1 - (-x))).

\displaystyle \begin{aligned} C_1 - C_2 &= \frac{\varepsilon A}{d - \delta} - \frac{\varepsilon A}{d + \delta} \\ &= \frac{\varepsilon A}{d} \left ( \frac{1}{1 - \frac{\delta}{d}} - \frac{1}{1 + \frac{\delta}{d}} \right ) \\ &= \frac{\varepsilon A}{d} \left ( 1 + \frac{\delta}{d} + \left ( \frac{\delta}{d} \right )^2 + O_{\delta \to 0}(\delta^3) - 1 + \frac{\delta}{d} - \left ( \frac{\delta}{d} \right )^2 + O_{\delta \to 0}(\delta^3) \right ) \\ &= \frac{\varepsilon A}{d} \left ( \frac{2\delta}{d} + O_{\delta \to 0}(\delta^3) \right ) \end{aligned}

Again, since the cubic and higher terms are negligibly small, we can “ignore” those parts. What remains is a linear response to the change in the middle plate’s displacement. This makes it significantly easier to measure. Because we’re measuring the difference in capacitance, this design is called a “differential capacitor.”

Though the math is tidy in retrospect, I marvel at how one might have conceived of this design from scratch. Did the inventor notice the symmetries in the Taylor series approximations could be arranged to negate each other? Was there some other sort of “physical intuition” at play?

Until next time!

Silent Duels—Parsing the Construction

Last time we discussed the setup for the silent duel problem: two players taking actions in [0,1], player 1 gets n chances to act, player 2 gets m, and each knows their probability of success when they act.

The solution is in a paper of Rodrigo Restrepo from the 1950s. In this post I’ll start detailing how I study this paper, and talk through my thought process for approaching a bag of theorems and proofs. If you want to follow along, I re-typeset the paper on Github.

Game Theory Basics

The Introduction starts with a summary of the setting of game theory. I remember most of this so I will just summarize the basics of the field. Skip ahead if you already know what the minimax theorem is, and what I mean when I say the “value” of a game.

A two-player game consists of a set of actions for each player—which may be finite or infinite, and need not be the same for both players—and a payoff function for each possible choice of actions. The payoff function is interpreted as the “utility” that player 1 gains and player 2 loses. If the payoff is negative, you interpret it as player 1 losing utility to player 2. Utility is just a fancy way of picking a common set of units for what each player treasures in their heart of hearts. Often it’s stated as money and we assume both players value cash the same way. Games in which the utility is always “one player gains exactly the utility lost by the other player” are called zero-sum.

With a finite set of actions, the payoff function is a table. For rock-paper-scissors the table is:

Rock, paper: -1
Rock, scissors: 1
Rock, rock: 0
Paper, paper: 0
Paper, scissors: -1
Paper, rock: 1
Scissors, paper: 1
Scissors, scissors: 0
Scissors, rock: -1

You could arrange this in a matrix and analyze the structure of the matrix, but we won’t. It doesn’t apply to our forthcoming setting where the players have infinitely many strategies.

A strategy is a possibly-randomized algorithm (whose inputs are just the data of the game, not including any past history of play) that outputs an action. In some games, the optimal strategy is to choose a single action no matter what your opponent does. This is sometimes called a pure, dominating strategy, not because it dominates your opponent, but because it’s better than all of your other options no matter what your opponent does. The output action is deterministic.

However, as with rock-paper-scissors, the optimal strategy for most interesting games requires each player to act randomly according to a fixed distribution. Such strategies are called mixed or randomized. For rock-paper-scissors, the optimal strategy is to choose rock, paper, and scissors with equal probability.  Computers are only better than humans at rock-paper-scissors because humans are bad at behaving consistently and uniformly random.

The famous minimax theorem says that every two-player zero-sum game has an optimal strategy for each player, which is possibly randomized. This strategy is optimal in the sense that it maximizes your expected winnings no matter what your opponent does. However, if your opponent is playing a particularly suboptimal strategy, the minimax solution might not be as good as a solution that takes advantage of the opponent’s dumb choices. A uniform random rock-paper-scissors strategy is not optimal if your opponent always plays “rock.”  However, the optimal strategy doesn’t need special knowledge or space to store information about past play. If you played against God, you would blindly use the minimax strategy and God would have no upper hand. I wonder if the pope would have excommunicated me for saying that in the 1600’s.

The expected winnings for player 1 when both players play a minimax-optimal strategy is called the value of the game, and this number is unique (even if there are possibly multiple optimal strategies). If a game is symmetric—meaning both players have the same actions and the payoff function is symmetric—then the value is guaranteed to be zero. The game is fair.

The version of the minimax theorem that most people use (in particular, the version that often comes up in theoretical computer science) shows that finding an optimal strategy is equivalent to solving a linear program. This is great because it means that any such (finite) game is easy to solve. You don’t need insight; just compile and run. The minimax theorem is also true for sufficiently well-behaved continuous action spaces. The silent duel is well-behaved, so our goal is to compute an explicit, easy-to-implement strategy that the minimax theorem guarantees exists. As a side note, here is an example of a poorly-behaved game with no minimax optimum.

While the minimax theorem guarantees optimal strategies and a value, the concept of the “value” of the game has an independent definition:

Let X, Y be finite sets of actions for players 1, 2 respectively, and p(x), q(y) be strategies, i.e., probability distributions over X and Y so that p(x) is the probability that x is chosen. Let \Psi(x, y) be the payoff function for the game. The value of the game is a real number v such that there exist two strategies p, q with the two following properties. First, for every fixed y \in Y,

\displaystyle \sum_{x \in X} p(x) \Psi(x, y) \geq v

(no matter what player 2 does, player 1’s strategy guarantees at least v payoff), and for every fixed x \in X,

\displaystyle \sum_{y \in Y} q(y) \Psi(x, y) \leq v

(no matter what player 1 does, player 2’s strategy prevents a loss of more than v).

Since silent duels are continuous, Restrepo opens the paper with the corresponding definition for continuous games. Here a probability distribution is the same thing as a “positive measure with total measure 1.” Restrepo uses F and G for the strategies, and the corresponding statement of expected payoff for player 1 is that, for all fixed actions y \in Y,

\displaystyle \int \Psi(x, y) dF(x) \geq v

And likewise, for all x \in X,

\displaystyle \int \Psi(x, y) dG(y) \leq v

All of this background gets us through the very first paragraph of the Restrepo paper. As I elaborate in my book, this is par for the course for math papers, because written math is optimized for experts already steeped in the context. Restrepo assumes the reader knows basic game theory so we can get on to the details of his construction, at which point he slows down considerably to focus on the details.

Description of the Optimal Strategies

Starting in section 2, Restrepo describes the construction of the optimal strategy, but first he explains the formal details of the setting of the game. We already know the two players are taking n and m actions between 0 \leq t \leq 1, but we also fix the probability of success. Player 1 knows a distribution P(t) on [0,1] for which P(t) is the probability of success when acting at time t. Likewise, player 2 has a possibly different distribution Q(t), and (crucially) P(t), Q(t) both increase continuously on [0,1]. (In section 3 he clarifies further that P satisfies P(0) = 0, P(1) = 1, and P'(t) > 0, likewise for Q(t).) Moreover, both players know both P, Q. One could say that each player has an estimate of their opponent’s firing accuracy, and wants to be optimal compared to that estimate.

The payoff function \Psi(x, y) is defined informally as: 1 if Player one succeeds before Player 2, -1 if Player 2 succeeds first, and 0 if both players exhaust their actions before the end and none succeed. Though Restrepo does not state it, if the players act and succeed at the same time—say both players fire at time t=1—the payoff should also be zero. We’ll see how this is converted to a more formal (and cumbersome!) mathematical definition in a future post.

Next we’ll describe the statement of the fully general optimal strategy (which will be essentially meaningless, but have some notable features we can infer information from), and get a sneak peek at how to build this strategy algorithmically. Then we’ll see a simplified example of the optimal strategy.

The optimal strategy presented depends only on the values n, m (the number of actions each player gets) and their success probability distributions P, Q. For player 1, the strategy splits up [0,1] into subintervals

\displaystyle [a_i, a_{i+1}] \qquad 0 < a_1 < a_2, < \cdots < a_n < a_{n+1} = 1

Crucially, this strategy ignores the initial interval [0, a_1]. In each other subinterval Player 1 attempts an action at a time chosen by a probability distribution specific to that interval, independently of previous attempts. But no matter what, there is some initial wait time during which no action will ever be taken. This makes sense: if player 1 fired at time 0, it is a guaranteed wasted shot. Likewise, firing at time 0.000001 is basically wasted (due to continuity, unless P(t) is obnoxiously steep early on).

Likewise for player 2, the optimal strategy is determined by numbers b_1, \dots, b_m resulting in m intervals [b_j, b_{j+1}] with b_{m+1} = 1.

The difficult part of the construction is describing the distributions dictating when a player should act during an interval. It’s difficult because an interval for player 1 and player 2 can overlap partially. Maybe a_2 = 0.5, a_3 = 0.75 and b_1 = 0.25, b_2 = 0.6. Player 1 knows that Player 2 (using their corresponding minimax strategy) must act before time t = 0.6, and gets another chance after that time. This suggests that the distribution determining when Player 1 should act within [a_2, a_3] may have a discontinuous jump at t = 0.6.

Call F_i the distribution for Player 1 to act in the interval [a_i, a_{i+1}]. Since it is a continuous distribution, Restrepo uses F_i for the cumulative distribution function and dF_i for the probability density function. Then these functions are defined by (note this should be mostly meaningless for the moment)

\displaystyle dF_i(x_i) = \begin{cases} h_i f^*(x_i) dx_i & \textup{ if } a_i < x_i < a_{i+1} \\ 0 & \textup{ if } x_i \not \in [a_i, a_{i+1}] \\ \end{cases}

where f^* is defined as

\displaystyle f^*(t) = \prod_{b_j > t} \left [ 1 - Q(b_j) \right ] \frac{Q'(t)}{Q^2(t) P(t)}.

The constants h_i and h_{i+1} are related by the equation

\displaystyle h_i = [1 - D_i] h_{i+1},

where

\displaystyle D_i = \int_{a_i}^{a_{i+1}} P(t) dF_i(t)

What can we glean from this mashup of symbols? The first is that (obviously) the distribution is zero outside the interval [a_i, a_{i+1}]. Within it, there is this mysterious h_i that is related to the h_{i+1} used to define the next interval’s probability. This suggests we will likely build up the strategy in reverse starting with F_n as the “base case” (if n=1, then it is the only one).

Next, we notice the curious definition of f^*. It unsurprisingly requires knowledge of both P and Q, but the coefficient is strangely chosen: it’s a product over all failure probabilities (1 - Q(b_j)) of all interval-starts happening later for the opponent.

[Side note: it’s very important that this is a constant; when I first read this, I thought that it was \prod_{b_j > t}[1 - Q(t)], which makes the eventual task of integrating f^* much harder.]

Finally, the last interval (the one ending at t=1) may include the option to simply “wait for a guaranteed hit,” which Restrepo calls a “discrete mass of \alpha at t=1.” That is, F_n may have a different representation than the rest. Indeed, at the end of the paper we will find that Restrepo gives a base-case definition for h_n that allows us to bootstrap the construction.

Player 2’s strategy is the same as Player 1’s, but replacing the roles of P, Q, n, m, a_i, b_j in the obvious way.

The symmetric example

As with most math research, the best way to parse a complicated definition or construction is to simplify the different aspects of the problem until they become tractable. One way to do this is to have only a single action for both players, with P = Q. Restrepo provides a more general example to demonstrate, which results in the five most helpful lines in the paper. I’ll reproduce them here verbatim:

EXAMPLE. Symmetric Game: P(t) = Q(t), and n = m. In this case the two
players have the same optimal strategies; \alpha = 0, and a_k = b_k, k=1, \dots, n. Furthermore

\displaystyle \begin{aligned} P(a_{n-k}) &= \frac{1}{2k+3} & k = 0, 1, \dots, n-1, \\ dF_{n-k}(t) &= \frac{1}{4(k+1)} \frac{P'(t)}{P^3(t)} dt & a_{n-k} < t < a_{n-k+1}. \end{aligned}

Saying \alpha = 0 means there is no “wait until t=1 to guarantee a hit”, which makes intuitive sense. You’d only want to do that if your opponent has exhausted all their actions before the end, which is only likely to happen if they have fewer actions than you do.

When Restrepo writes P(a_{n-k}) = \frac{1}{2k+3}, there are a few things happening. First, we confirm that we’re working backwards from a_n. Second, he’s implicitly saying “choose a_{n-k} such that P(a_{n-k}) has the desired cumulative density.” After a bit of reflection, there’s no other way to specify the a_i except implicitly: we don’t have a formula for P to lean on.

Finally, the definition of the density function dF_{n-k}(t) helps us understand under what conditions the probability function would be increasing or decreasing from the start of the interval to the end. Looking at the expression P'(t) / P^3(t), we can see that polynomials will result in an expression dominated by 1/t^k for some k, which is decreasing. By taking the derivative, an increasing density would have to be built from a P satisfying P''(t) P(t) - 3(P'(t))^2 > 0. However, I wasn’t able to find any examples that satisfy this. Polynomials, square roots, logs and exponentials, all seem to result in decreasing density functions.

Finally, we’ll plot two examples. The first is the most reductive: P(t) = Q(t) = t, and n = m = 1. In this case n=1, and there is only one term k=0, for which a_n = 1/3. Then dF_1(t) = 1/4t^3. (For verification, note the integral of dF_1 on [1/3, 1] is indeed 1).

restrepo-1-over-4tcubed.png

With just one action and P(t) = Q(t) = t, the region before t=1/3 has zero probability, and the probability decreases from 6.75 to 1/4.

Note that the reason a_n = 1/3 is so nice is that P(t) is so simple. If P(t) were, say, t^2, then a_n should shift to being \sqrt{1/3}. If P(t) were more complicated, we’d have to invert it (or use an approximate search) to find the location a_n for which P(a_n) = 1/3.

Next, we loosen the example to let n=m=4, still with P(t) = Q(t) = t. In this case, we have the same final interval [1/3,1]. The new actions all occur in the time before t=1/3, in the intervals [1/5, 1/3], [1/7, 1/5], [1/9,1/7]. If there were more actions, we’d get smaller inverse-of-odd-spaced intervals approaching zero. The probability densities are now steeper versions of the same 1/4t^3, with the constant getting smaller to compensate for the fact that 1/t^3 gets larger and maintain the normalized distribution. For example, the earliest interval results in \int_{1/9}^{1/7} \frac{1}{16t^3} dt = 1. Closer to zero the densities are somewhat shallower compared to the size of the interval; for example in [1/9, 1/7], the density toward the beginning of the interval is only about twice as large as the density toward the end.

restrepo-four-actions.png

The combination of the four F_i’s for the four intervals in which actions are taken. This is a complete description of the optimal strategy for our simple symmetric version of the silent duel.

Since the early intervals are getting smaller and smaller as we add more actions, the optimal strategy will resemble a burst of action at the beginning, gradually tapering off as the accuracy increases and we work through our budget. This is an explicit tradeoff between the value of winning (lots of early, low probability attempts) and keeping some actions around for the end where you’re likely to succeed.

Next step: get to the example from the general theorem

At this point, we’ve parsed the general statement of the theorem, and while much of it is still mysterious, we extracted some useful qualitative information from the statement, and tinkered with some simple examples.

At this point, I have confidence that the simple symmetric example Restrepo provided is correct; it passed some basic unit tests, like that each dF_i is normalized. My next task in fully understanding the paper is to be able to derive the symmetric example from the general construction. We’ll do this next time, and include a program that constructs the optimal solution for any input.

Until then!