Last time we discussed the setup for the silent duel problem: two players taking actions in , player 1 gets chances to act, player 2 gets , and each knows their probability of success when they act.
The solution is in a paper of Rodrigo Restrepo from the 1950s. In this post I’ll start detailing how I study this paper, and talk through my thought process for approaching a bag of theorems and proofs. If you want to follow along, I re-typeset the paper on Github.
Game Theory Basics
The Introduction starts with a summary of the setting of game theory. I remember most of this so I will just summarize the basics of the field. Skip ahead if you already know what the minimax theorem is, and what I mean when I say the “value” of a game.
A two-player game consists of a set of actions for each player—which may be finite or infinite, and need not be the same for both players—and a payoff function for each possible choice of actions. The payoff function is interpreted as the “utility” that player 1 gains and player 2 loses. If the payoff is negative, you interpret it as player 1 losing utility to player 2. Utility is just a fancy way of picking a common set of units for what each player treasures in their heart of hearts. Often it’s stated as money and we assume both players value cash the same way. Games in which the utility is always “one player gains exactly the utility lost by the other player” are called zero-sum.
With a finite set of actions, the payoff function is a table. For rock-paper-scissors the table is:
Rock, paper: -1
Rock, scissors: 1
Rock, rock: 0
Paper, paper: 0
Paper, scissors: -1
Paper, rock: 1
Scissors, paper: 1
Scissors, scissors: 0
Scissors, rock: -1
You could arrange this in a matrix and analyze the structure of the matrix, but we won’t. It doesn’t apply to our forthcoming setting where the players have infinitely many strategies.
A strategy is a possibly-randomized algorithm (whose inputs are just the data of the game, not including any past history of play) that outputs an action. In some games, the optimal strategy is to choose a single action no matter what your opponent does. This is sometimes called a pure, dominating strategy, not because it dominates your opponent, but because it’s better than all of your other options no matter what your opponent does. The output action is deterministic.
However, as with rock-paper-scissors, the optimal strategy for most interesting games requires each player to act randomly according to a fixed distribution. Such strategies are called mixed or randomized. For rock-paper-scissors, the optimal strategy is to choose rock, paper, and scissors with equal probability. Computers are only better than humans at rock-paper-scissors because humans are bad at behaving consistently and uniformly random.
The famous minimax theorem says that every two-player zero-sum game has an optimal strategy for each player, which is possibly randomized. This strategy is optimal in the sense that it maximizes your expected winnings no matter what your opponent does. However, if your opponent is playing a particularly suboptimal strategy, the minimax solution might not be as good as a solution that takes advantage of the opponent’s dumb choices. A uniform random rock-paper-scissors strategy is not optimal if your opponent always plays “rock.” However, the optimal strategy doesn’t need special knowledge or space to store information about past play. If you played against God, you would blindly use the minimax strategy and God would have no upper hand. I wonder if the pope would have excommunicated me for saying that in the 1600’s.
The expected winnings for player 1 when both players play a minimax-optimal strategy is called the value of the game, and this number is unique (even if there are possibly multiple optimal strategies). If a game is symmetric—meaning both players have the same actions and the payoff function is symmetric—then the value is guaranteed to be zero. The game is fair.
The version of the minimax theorem that most people use (in particular, the version that often comes up in theoretical computer science) shows that finding an optimal strategy is equivalent to solving a linear program. This is great because it means that any such (finite) game is easy to solve. You don’t need insight; just compile and run. The minimax theorem is also true for sufficiently well-behaved continuous action spaces. The silent duel is well-behaved, so our goal is to compute an explicit, easy-to-implement strategy that the minimax theorem guarantees exists. As a side note, here is an example of a poorly-behaved game with no minimax optimum.
While the minimax theorem guarantees optimal strategies and a value, the concept of the “value” of the game has an independent definition:
Let be finite sets of actions for players 1, 2 respectively, and be strategies, i.e., probability distributions over and so that is the probability that is chosen. Let be the payoff function for the game. The value of the game is a real number such that there exist two strategies with the two following properties. First, for every fixed ,
(no matter what player 2 does, player 1’s strategy guarantees at least payoff), and for every fixed ,
(no matter what player 1 does, player 2’s strategy prevents a loss of more than ).
Since silent duels are continuous, Restrepo opens the paper with the corresponding definition for continuous games. Here a probability distribution is the same thing as a “positive measure with total measure 1.” Restrepo uses and for the strategies, and the corresponding statement of expected payoff for player 1 is that, for all fixed actions ,
And likewise, for all ,
All of this background gets us through the very first paragraph of the Restrepo paper. As I elaborate in my book, this is par for the course for math papers, because written math is optimized for experts already steeped in the context. Restrepo assumes the reader knows basic game theory so we can get on to the details of his construction, at which point he slows down considerably to focus on the details.
Description of the Optimal Strategies
Starting in section 2, Restrepo describes the construction of the optimal strategy, but first he explains the formal details of the setting of the game. We already know the two players are taking and actions between , but we also fix the probability of success. Player 1 knows a distribution on for which is the probability of success when acting at time . Likewise, player 2 has a possibly different distribution , and (crucially) both increase continuously on . (In section 3 he clarifies further that satisfies , and , likewise for .) Moreover, both players know both . One could say that each player has an estimate of their opponent’s firing accuracy, and wants to be optimal compared to that estimate.
The payoff function is defined informally as: 1 if Player one succeeds before Player 2, -1 if Player 2 succeeds first, and 0 if both players exhaust their actions before the end and none succeed. Though Restrepo does not state it, if the players act and succeed at the same time—say both players fire at time —the payoff should also be zero. We’ll see how this is converted to a more formal (and cumbersome!) mathematical definition in a future post.
Next we’ll describe the statement of the fully general optimal strategy (which will be essentially meaningless, but have some notable features we can infer information from), and get a sneak peek at how to build this strategy algorithmically. Then we’ll see a simplified example of the optimal strategy.
The optimal strategy presented depends only on the values (the number of actions each player gets) and their success probability distributions . For player 1, the strategy splits up into subintervals
Crucially, this strategy ignores the initial interval . In each other subinterval Player 1 attempts an action at a time chosen by a probability distribution specific to that interval, independently of previous attempts. But no matter what, there is some initial wait time during which no action will ever be taken. This makes sense: if player 1 fired at time 0, it is a guaranteed wasted shot. Likewise, firing at time 0.000001 is basically wasted (due to continuity, unless is obnoxiously steep early on).
Likewise for player 2, the optimal strategy is determined by numbers resulting in intervals with .
The difficult part of the construction is describing the distributions dictating when a player should act during an interval. It’s difficult because an interval for player 1 and player 2 can overlap partially. Maybe and . Player 1 knows that Player 2 (using their corresponding minimax strategy) must act before time , and gets another chance after that time. This suggests that the distribution determining when Player 1 should act within may have a discontinuous jump at .
Call the distribution for Player 1 to act in the interval . Since it is a continuous distribution, Restrepo uses for the cumulative distribution function and for the probability density function. Then these functions are defined by (note this should be mostly meaningless for the moment)
where is defined as
The constants and are related by the equation
What can we glean from this mashup of symbols? The first is that (obviously) the distribution is zero outside the interval . Within it, there is this mysterious that is related to the used to define the next interval’s probability. This suggests we will likely build up the strategy in reverse starting with as the “base case” (if , then it is the only one).
Next, we notice the curious definition of . It unsurprisingly requires knowledge of both and , but the coefficient is strangely chosen: it’s a product over all failure probabilities () of all interval-starts happening later for the opponent.
[Side note: it’s very important that this is a constant; when I first read this, I thought that it was , which makes the eventual task of integrating much harder.]
Finally, the last interval (the one ending at ) may include the option to simply “wait for a guaranteed hit,” which Restrepo calls a “discrete mass of at .” That is, may have a different representation than the rest. Indeed, at the end of the paper we will find that Restrepo gives a base-case definition for that allows us to bootstrap the construction.
Player 2’s strategy is the same as Player 1’s, but replacing the roles of in the obvious way.
The symmetric example
As with most math research, the best way to parse a complicated definition or construction is to simplify the different aspects of the problem until they become tractable. One way to do this is to have only a single action for both players, with . Restrepo provides a more general example to demonstrate, which results in the five most helpful lines in the paper. I’ll reproduce them here verbatim:
EXAMPLE. Symmetric Game: and . In this case the two
players have the same optimal strategies; , and . Furthermore
Saying means there is no “wait until to guarantee a hit”, which makes intuitive sense. You’d only want to do that if your opponent has exhausted all their actions before the end, which is only likely to happen if they have fewer actions than you do.
When Restrepo writes , there are a few things happening. First, we confirm that we’re working backwards from . Second, he’s implicitly saying “choose such that has the desired cumulative density.” After a bit of reflection, there’s no other way to specify the except implicitly: we don’t have a formula for to lean on.
Finally, the definition of the density function helps us understand under what conditions the probability function would be increasing or decreasing from the start of the interval to the end. Looking at the expression , we can see that polynomials will result in an expression dominated by for some , which is decreasing. By taking the derivative, an increasing density would have to be built from a satisfying . However, I wasn’t able to find any examples that satisfy this. Polynomials, square roots, logs and exponentials, all seem to result in decreasing density functions.
Finally, we’ll plot two examples. The first is the most reductive: , and . In this case , and there is only one term , for which . Then . (For verification, note the integral of on is indeed 1).
Note that the reason is so nice is that is so simple. If were, say, , then should shift to being . If were more complicated, we’d have to invert it (or use an approximate search) to find the location for which .
Next, we loosen the example to let , still with . In this case, we have the same final interval . The new actions all occur in the time before , in the intervals If there were more actions, we’d get smaller inverse-of-odd-spaced intervals approaching zero. The probability densities are now steeper versions of the same , with the constant getting smaller to compensate for the fact that gets larger and maintain the normalized distribution. For example, the earliest interval results in . Closer to zero the densities are somewhat shallower compared to the size of the interval; for example in the density toward the beginning of the interval is only about twice as large as the density toward the end.
Since the early intervals are getting smaller and smaller as we add more actions, the optimal strategy will resemble a burst of action at the beginning, gradually tapering off as the accuracy increases and we work through our budget. This is an explicit tradeoff between the value of winning (lots of early, low probability attempts) and keeping some actions around for the end where you’re likely to succeed.
Next step: get to the example from the general theorem
At this point, we’ve parsed the general statement of the theorem, and while much of it is still mysterious, we extracted some useful qualitative information from the statement, and tinkered with some simple examples.
At this point, I have confidence that the simple symmetric example Restrepo provided is correct; it passed some basic unit tests, like that each is normalized. My next task in fully understanding the paper is to be able to derive the symmetric example from the general construction. We’ll do this next time, and include a program that constructs the optimal solution for any input.