Silent Duels and an Old Paper of Restrepo

Silent Duels—Parsing the Construction

Last time we waded into Restrepo’s silent duel paper. You can see the original and my re-typeset version on Github along with all of the code in this series. We digested Section 2 and a bit further, plotting some simplified examples of the symmetric version of the game.

I admit, this paper is not an easy read. Sections 3-6 present the central definitions, lemmas, and proofs. They’re a slog of dense notation with no guiding intuition. The symbols don’t even get catchy names or “think of this as” explanations! I think this disparity in communication has something to do with the period in which the paper was written, and something to do with the infancy of computing at the time. The paper was published in 1957, which marked the year IBM switched from vacuum tubes to transistors, long before Moore’s Law was even a twinkle in Gordon Moore’s eye. We can’t blame Restrepo for not appealing to our modern sensibilities.

I spent an embarrassing amount of time struggling through Sections 3-6 when I still didn’t really understand the form of the optimal strategy. It’s not until the very *end* of the paper (Section 8, the proof of Theorem 1) that we get a construction. See the last post for a detailed description of the data that constitutes the optimal strategy. In brief, it’s a partition of into subintervals with a probability distribution on each interval, and the time you take your -th action is chosen randomly according to . Optionally, the last distribution can include a point-mass at time , i.e., “wait until the last moment for a guaranteed hit.”

Section 8 describes how to choose the ‘s and ‘s, with the distributions and built according to the formulas built up in the previous sections.

Since our goal is still to understand how to construct the solution—even if *why *it works is still a mystery—we’ll write a program that implements this algorithm in two posts. First, we’ll work with a simplified symmetric game, where the answer is provided for us as a test case. In a followup post, we’ll rework the code to construct the generic solution, and pick nits about code quality, and the finer points of the algorithm Restrepo leaves out.

Ultimately, if what the program outputs matches up with Restrepo’s examples (in lieu of understanding enough of the paper to construct our own), we will declare victory—we’ll have successfully sanity-checked Restrepo’s construction. Then we can move on to studying why this solution works and what caveats hide beneath the math.

The input to the game is a choice of actions for player 1, actions for player 2, and probability distributions for the two player’s success probabilities (respectively). Here’s the algorithm as stated by Restrepo, with the referenced equations following. If you’re following along with my “work through the paper organically” shtick, I recommend you try parsing the text below before reading on. Recall is the “wait until the end” probability for player 1’s last action, and is the analogous probability for player 2.

Let’s sort through this mess.

First, the broad picture. The whole construction depends on , these point masses for the players’ final actions. However, there’s this condition that , i.e., at most one can be nonzero. This makes some vaguely intuitive sense: a player with more actions will have extra “to spare,” and so it may make sense for them to wait until the very end to get a guaranteed hit. But only one player can have such an advantage over the other, so only one of the two parameters may be nonzero. That’s my informal justification for .

If we don’t know at the beginning, Restrepo’s construction (the choice of ‘s and ‘s) is a *deterministic* function of , and the other fixed inputs.

The construction asserts that the optimal solution has and we need to find an input such that and they produce as output. We’re doing a search for the “right” output parameters, and using knowledge about the chained relationship of equations to look at the output, and use it to tweak the input to get the output closer to what we want. It’s not gradient descent, but it could probably be rephrased that way.

In particular, consider the case when we get , and the other should be clear. Suppose that starting from we construct all our ‘s and ‘s and get . Then we can try again with , but since is illegal we’ll use for as small a as we need (to make the next sentence true). Restrepo claims that picking something close enough to 1 will *reverse* the output, i.e. will make . He then finishes with (my paraphrase), “obviously are continuous in terms of , so a solution exists with for some choice of ; that’s the optimal solution.” Restrepo is relying on the intermediate value theorem from calculus, but to *find* that value the simplest option is binary search. We have the upper and lower bounds, and , and we know when we found our target: when the output has .

This binary search will come back in full swing in the next post, since we already know that the symmetric silent duel starts with . No search for is needed, and we can fix them both to zero for now—or rather, assume the right values are known.

What remains is to determine how to compute the ‘s and ‘s from a starting . We’ll go through the algorithm step by step using the symmetric game where (same action success probability) and (same action count) to ground our study. A followup post will revisit these formulas in full generality.

The basic idea of the construction is that we start from a computation of the last action parameters , and use those inductively to compute the parameters of earlier actions via a few integrals and substitutions. In other words, the construction is a recursion, and the interval in which the players take their last action is the base case. As I started writing the programs below, I wanted to give a name to these values. Restrepo seems to refer to them as “parameters” in the paper. I call them *transition times*, since the mark the instants at which a player “transitions” from one action interval to the next .

For a simple probability function , the end of the algorithm results in equations similar to: choose such that .

Recall, Player 1 has a special function used in each step to construct their optimal strategy, called by Restrepo. It’s defined for non-symmetric game as follows, where recall is the opponent’s action probability:

[Note the is a product ; not an iterated function application.]

Here the asks us to look at all the transition times computed in previous recursive steps, and compute the product of an action failure at those instants. This is the product . This is multiplied by a mysterious fraction involving , which in the symmetric case reduces to . In Python code, computing is given below—called simply “f_star” because I don’t yet understand how to interpret it in a meaningful way. I chose to use SymPy to compute symbolic integrals, derivatives, and solve equations, so in the function below, `prob_fun`

and `prob_fun_var`

are SymPy expressions and lambdas.

from sympy import diff def f_star(prob_fun, prob_fun_var, larger_transition_times): '''Compute f* as in Restrepo '57. In this implementation, we're working in the simplified example where P = Q (both players probabilities of success are the same). ''' x = prob_fun_var P = prob_fun product = 1 for a in larger_transition_times: product *= (1 - P(a)) return product * diff(P(x), x) / P(x)**3

In this symmetric instance of the problem we already know that is the optimal solution (Restrepo states that in section 2), so we can fix , and compute , which we do next.

In the paper, Restrepo says “compute without parameters in the definition of ” and this I take to mean, because there are no larger action instants, the product is empty, i.e., we pass an empty list of `larger_transition_times`

. Restrepo does violate this by occasionally referring to and , but if we included either of these , and this would make the definition of zero, which would produce a non-distribution, so that can’t be right. This is one of those cases where, when reading a math paper, you have to infer the interpretation that is most sensible, and give the author the benefit of the doubt.

Following the rest of the equations is trivial, except in that we are solving for which is a limit of integration. Since SciPy works symbolically, however, we can simply tell it to integrate without telling it , and ask it to solve for .

from sympy import Integral from sympy import Symbol from sympy.solvers import solve from sympy.functions.elementary.miscellaneous import Max def compute_a_n(prob_fun, alpha=0): P = prob_fun t = Symbol('t0', positive=True) a_n = Symbol('a_n', positive=True) a_n_integral = Integral( ((1 + alpha) - (1 - alpha) * P(t)) * f_star(P, t, []), (t, a_n, 1)) a_n_integrated = a_n_integral.doit() # yes now "do it" P_a_n_solutions = solve(a_n_integrated - 2 * (1 - alpha), P(a_n)) P_a_n = Max(*P_a_n_solutions) print("P(a_n) = %s" % P_a_n) a_n_solutions = solve(P(a_n) - P_a_n, a_n) a_n_solutions_in_range = [soln for soln in a_n_solutions if 0 &lt; soln &lt;= 1] assert len(a_n_solutions_in_range) == 1 a_n = a_n_solutions_in_range[0] print("a_n = %s" % a_n) h_n_integral = Integral(f_star(P, t, []), (t, a_n, 1)) h_n_integrated = h_n_integral.doit() h_n = (1 - alpha) / h_n_integrated print("h_n = %s" % h_n) return (a_n, h_n)

There are three phases here. First, we integrate and solve for (blindly according to equation 27 in the paper). If you work out this integral by hand (expanding ), you’ll notice it looks like , which suggests a natural substitution, . After computing the integral (entering phase 2), we can maintain that substitution to first solve for , say the output of that is some known number which in the code we call `P_a_n`

, and then solve for . Since that last equation can have multiple solutions, we pick the one between 0 and 1. Since must be increasing in , that guarantees uniqueness.

Note, we didn’t *need* to maintain the substitution in the integral, and perform a second solve. We could just tell sympy to solve directly for , and it would solve in addition to computing the integral. But as I was writing, it was helpful for me to check my work in terms of the substitution. In the next post we’ll clean that code up a bit.

Finally, in the third phase we compute the , which is a normalizing factor that ultimately ensures the probability distribution for the action in this interval has total probability mass 1.

The steps to compute the iterated lower action parameters ( for ) are similar, but the formulas are slightly different:

Note that the last action instant and its normalizing constant show up in the equation to compute . In code, this is largely the same as the `compute_a_n`

function, but in a loop. Along the way, we print some helpful diagnostics for demonstration purposes. These should end up as unit tests, but as I write the code for the first time I prefer to debug this way. I’m not even sure if I understand the construction well enough to do the math myself and write down unit tests that make sense; the first time I tried I misread the definition of and I filled pages with confounding and confusing integrals!

from collections import deque def compute_as_and_bs(prob_fun, n, alpha=0): '''Compute the a's and b's for the symmetric silent duel.''' P = prob_fun t = Symbol('t0', positive=True) a_n, h_n = compute_a_n(prob_fun, alpha=alpha) normalizing_constants = deque([h_n]) transitions = deque([a_n]) f_star_products = deque([1, 1 - P(a_n)]) for step in range(n): # prepending new a's and h's to the front of the list last_a = transitions[0] last_h = normalizing_constants[0] next_a = Symbol('a', positive=True) next_a_integral = Integral( (1 - P(t)) * f_star(P, t, transitions), (t, next_a, last_a)) next_a_integrated = next_a_integral.doit() # print("%s" % next_a_integrated) P_next_a_solutions = solve(next_a_integrated - 1 / last_h, P(next_a)) print("P(a_{n-%d}) is one of %s" % (step + 1, P_next_a_solutions)) P_next_a = Max(*P_next_a_solutions) next_a_solutions = solve(P(next_a) - P_next_a, next_a) next_a_solutions_in_range = [ soln for soln in next_a_solutions if 0 &lt; soln &lt;= 1] assert len(next_a_solutions_in_range) == 1 next_a_soln = next_a_solutions_in_range[0] print("a_{n-%d} = %s" % (step + 1, next_a_soln)) next_h_integral = Integral( f_star(P, t, transitions), (t, next_a_soln, last_a)) next_h = 1 / next_h_integral.doit() print("h_{n-%d} = %s" % (step + 1, next_h)) print("dF_{n-%d} coeff = %s" % (step + 1, next_h * f_star_products[-1])) f_star_products.append(f_star_products[-1] * (1 - P_next_a)) transitions.appendleft(next_a_soln) normalizing_constants.appendleft(next_h) return transitions

Finally, we can run it with the simplest possible probability function:

x = Symbol('x') compute_as_and_bs(Lambda((x,), x), 3)

The output is

P(a_n) = 1/3 a_n = 1/3 h_n = 1/4 P(a_{n-1}) is one of [1/5] a_{n-1} = 1/5 h_{n-1} = 3/16 dF_{n-1} coeff = 1/8 P(a_{n-2}) is one of [1/7] a_{n-2} = 1/7 h_{n-2} = 5/32 dF_{n-2} coeff = 1/12 P(a_{n-3}) is one of [1/9] a_{n-3} = 1/9 h_{n-3} = 35/256 dF_{n-3} coeff = 1/16

This matches up so far with Restrepo’s example, since gives . Since we have the normalizing constants , we can also verify the probability distribution for each action aligns with Restrepo’s example. The constant in the point mass function is supposed to be . This is what I printed out as `dF_{n-k}`

. In Restrepo’s example, this is expected to be , which is exactly what’s printed out.

Another example using :

P(a_n) = 1/3 a_n = 3**(2/3)/3 h_n = 1/4 P(a_{n-1}) is one of [-1/3, 1/5] a_{n-1} = 5**(2/3)/5 h_{n-1} = 3/16 dF_{n-1} coeff = 1/8 P(a_{n-2}) is one of [-1/5, 1/7] a_{n-2} = 7**(2/3)/7 h_{n-2} = 5/32 dF_{n-2} coeff = 1/12 P(a_{n-3}) is one of [-1/7, 1/9] a_{n-3} = 3**(1/3)/3 h_{n-3} = 35/256 dF_{n-3} coeff = 1/16

One thing to notice here is that the normalizing constants don’t appear to depend on the distribution. Is this a coincidence for this example, or a pattern? I’m not entirely sure.

Next time we’ll rewrite the code from this post so that it can be used to compute the generic (non-symmetric) solution, see what they can tell us, and from there we’ll start diving into the propositions and proofs.

Until next time!

]]>One might think, “Large optimization problems? That sounds hard!” But it’s actually far from the hardest part of the job. In fact, it’s one of the few exciting parts of the job. The real hard part is getting data. Really, it’s that you get promised data that never materializes, and then you architect your system for features that rot before they ripen.

There’s a classic image of a human acting as if they’re throwing a ball for a dog, and the dog sprints off, only soon to realize the ball was never thrown. The ball is the promise of freshly maintained data, and recently I’ve been the dog.

When you don’t have good data, or you have data that’s bad in a known way, you can always try to design your model to accommodate for the deficiencies. As long as it’s clearly defined, it’s not beyond our reach. The math is fun and challenging, and I don’t *want* to shy away from it. My mathematician’s instinct pulls me left.

My instincts as an engineer pull me right: data issues will ultimately cause unexpected edge cases at the worst moment, and it will fall on me to spend all day debugging for a deadline tomorrow. Data issues lead to more complicated modeling features which further interact with other parts of the model and the broader system in confounding ways. Worst of all, it’s nearly impossible to signal problems to customers who depend on your output. When technical debt is baked into an optimization model as features, it makes explanation much harder. Accepting bad data also requires you write the code in a way that is easy to audit, since you need to audit literally everything. Transparency is good, but it’s tedious to do it generically well, and the returns are not worth it if the end result is, “well we can’t fix the data for two years anyway.”

Though a lot of this technical debt was introduced by predecessors who left the team, I’ve fallen for the mathematical siren’s call a few times. *Go on, just add that slick new constraint. Just mask that misbehavior with a heuristic.* It’s bit back hard and caused months of drag.

These days I’m swinging hard right on the pendulum. Delete half-implemented features that don’t have data to support them. Delete features that don’t have a clear connection to business needs (even if they work). Push back on new feature requests until the data exists. Require a point of contact and an SLO for any data you don’t own. Make speculative features easy to turn on/off (or remove!) without having to redesign the architecture. If it can’t be made easy to remove, don’t add it until you’re sure it will survive.

If you can’t evade bad data, err on the side of strict initial validation, and doing nothing or gracefully degrading service when validation fails. Expose the failures as alerts to the people who own the data (not you), and give the data owners a tool that repeats the validation logic in your system verbatim, so there is no confusion on the criteria for success. When you have this view, almost all of the complexity in your system lies in enabling this generic auditing, alerting, and managing of intricate (but ultimately arbitrary) policy.

I like to joke that I don’t have data-intensive applications or problems of scale, but rather *policy*-intensive applications. I haven’t found much insight from other software engineers about how to design and maintain policy-intensive software. Let me know if you have some! The obvious first step is to turn policy code into data. To the extent that we’ve done this, I adore that aspect of our systems. Still, you can’t avoid it when policies need to be encoded in an optimization model.

I do get sad that so much of my time is spent poop-smithing, as I like to say, even though we’re gradually getting better. Our systems need maintenance and care, and strong principles to keep the thicket from overwhelming us. For one, I track net lines of code added, with the goal to have it be net *negative* month over month, new features and all. We’ve kept it up for six months so far. Even our fixit week this week seems unnecessary, given how well our team has internalized paying off technical debt.

Though I do wonder what it’s all for. So Google can funnel the money it saves on datacenter costs into ~~informing people the Earth is flat~~ cat videos? If I didn’t have two particular internal side projects to look forward to—they involve topics I’m very interested in—I’d be bored, and I might succumb to jaded feelings, and I’d need a change. Certain perks and particularly enjoyable colleagues help avoid that. But still, I rarely have time to work on the stimulating projects, and even the teammates I’ve been delegating to often defer it to other priorities.

We let dirty data interfere with our design and architecture, now we’re paying back all that technical debt, and as a consequence there’s no time for our human flourishing.

I should open a math cafe.

]]>If you want to skip ahead to a working solution, and you can run basic python scripts, see the Github repository for this post with details on how to run it and what the output looks like. It’s about 30 lines of code, and maybe 10 of those lines are not obvious. Alternatively, copy/paste the code posted inline in the section “First attempt: regular expressions.”

In this post I’ll guide our hero through the world of text manipulation, explain the options for solving a problem like this, and finally explain how to build the program in the repository from scratch. This article assumes you have access to a machine that has basic programming tools pre-installed on it, such as python and perl.

LaTeX is great, don’t get me wrong, but people who don’t have experience writing computer programs that operate on human input tend to write sloppy LaTeX. They don’t anticipate that they might need to programmatically modify the file because the need was never there before. The fact that many LaTeX compilers are relatively forgiving with syntax errors exacerbates the issue.

The most common way to enter **math mode** is with dollar signs, as in

Now let $\varepsilon > 0$ and set $\delta = 1/\varepsilon$.

For math equations that must be **offset**, one often first learns to use double-dollar-signs, as in

First we claim that $$0 \to a \to b \to c \to 0$$ is a short exact sequence

The specific details that make it hard to find and convert from this delimiter type to another are:

- Math mode can be broken across lines, but need not be.
- A simple search and replace for $ would conflict with $$.
- The fact that the start and end are symmetric means a simple search and replace for $$ fails: you can’t tell whether to replace it with \[ or \] without knowing the context of where it occurs in the document.
- You can insert a dollar sign in LaTeX using \$ and it will not enter math mode. (I do not solve this problem in my program, but leave it as an exercise to the reader to modify each solution to support this)

The first thing most programmers will tell you when you have a text manipulation problem is to use regular expressions (or regex). Regular expressions are text patterns that a program called a regular expression engine uses to find subsets of text in a document. This can often be with the goal of modifying the matched text somehow, but also just to find places where the text occurs to generate a report.

In their basic form, regular expressions are based on a very clean theory called regular languages, which is a kind of grammar equivalent to “structure that can be recognized by a finite state machine.”

[Aside: some folks prefer to distinguish between regular expressions as implemented by software systems (regex) and regular expressions as a representation of a regular language; as it turns out, features added to regex engines make them strictly stronger than what can be represented by the theory of regular languages. In this post I will use “regex” and “regular expressions” both for practical implementations, because programmers and software don’t talk about the theory, and use “regular languages” for the CS theory concept]

The problem is that practical regular expressions are difficult and nit-picky, especially when there are exceptional cases to consider. Even matching something like a date can require a finnicky expression that’s hard for humans to read and debug when they are incorrect. A regular expression for a line in a file that contains a date by itself:

^\s*(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])\s*$

Worse, regular expressions are rife with “gotchas” having to do with escape characters. For example, parentheses are used for something called “capturing”, so you have to use \( to insert a literal parenthesis. If having to use “\$” in LaTeX bothers you, you’ll hate regular expressions.

Another issue comes from history. There was a time when computers only allowed you to edit a file one line at a time. Many programming tools and frameworks were invented during this time that continue to be used today (you may have heard of sed which is a popular regular expression find/replace program—one I use almost daily). These tools struggle to operate on problems that span many lines of a file, because they simply weren’t designed for that. Problem (1) above suggests this might be a problem.

Yet another issue is in slight discrepancies between regex engines. Perl, python, sed, etc., all have slight variations and “nonstandard” features. As all programmers know, every visible behavior of a system will eventually be depended on by some other system.

But the real *core* problem is that regular expressions weren’t really designed for knowing about the context where a match occurs. Regular expressions are designed to be character-at-a-time pattern matching. *[edit: removed an incorrect example]* But over time, regular expression engines have added features to do such things over the years (which makes them more powerful than the original, formal definition of a regular language, and even more powerful than what parsers can handle!), but the more complicated you make a regular expression, the more likely it’s going to misbehave on odd inputs, and less likely others can use it without bugs or modification for their particular use case. Software engineers care very much about such things, though mathematicians needing a one-off solution may not.

One redeeming feature of regular expressions is that—by virtue of being so widely used in industry—there are many tools to work with them. Every major programming language has a regular expression engine built in. And many websites help explain how regexes work. regexr.com is one I like to use. Here is an example of using that website to replace offset mathmode delimiters. Note the “Explain” button, which traces the regex engine as it looks for matches.

So applying this to our problem: we can use two regular expressions to solve the problem. I’m using the perl programming language because its regex engine supports multiline matches. All MacOS and linux systems come with perl pre-installed.

` perl -0777 -pe 's/\$\$(.*?)\$\$/\\[\1\\]/gs' < test.tex | perl -0777 -pe 's/\$(.*?)\$/\\(\1\\)/gs' > output.tex`

Now let’s explain, starting with the core expressions being matched and replaced.

s/X/Y/ tells a regex engine to “substitute regex matches of X with Y”. In the first regex X is **\$\$(.*?)\$\$**, which breaks down as

**\$\$**match two literal dollar signs**(**capture a group of characters signified by the regex between here and the matching closing parenthesis**.***zero or more of any character**?**looking at the “zero or more” in the previous step, try to match as few possible characters as you can while still making this pattern successfully match something

**)**stop the capture group, and save it as group**1****\$\$**match two more literal dollar signs

Then Y is the chosen replacement. We’re processing offset mathmode, so we want \[ \]. Y is **\\[\1\\]**, which means

**\\**a literal backslash**[**a literal open bracket**\1**the first capture group from the matched expression**\\**a literal backslash**]**a literal close bracket

All together we have **s/\$\$(.*?)\$\$/\\[\1\\]/**, but then we add a final **s** and **g** characters, which act as configuration. The “s” tells the regex engine to allow the dot **.** to match newlines (so a pattern can span multiple lines) and the “g” tells the regex to apply the substitution globally to every match it sees—as opposed to just the first.

Finally, the full first command is

`perl -0777 -pe 's/\$\$(.*?)\$\$/\\[\1\\]/gs' < test.tex `

This tells perl to read in the entire test.tex file and apply the regex to it. Broken down

**perl**run perl**-0777**read the entire file into one string. If you omit it, perl will apply the regex to each line separately.**-p**will make perl automatically “read input and print output” without having to tell it to with a “print” statement**e**tells perl to run the following command line argument as a program.**< test.tex**tells perl to use the file test.tex as input to the program (as input to the regex engine, in this case).

Then we pipe the output of this first perl command to a second one that does a very similar replacement for inline math mode.

```
<first_perl_command> | perl -0777 -pe 's/\$(.*?)\$/\\(\1\\)/gs'
```

The vertical bar | tells the shell executing the commands to take the output of the first program and feed it as input to the second program, allowing us to chain together sequences of programs. The second command does the same thing as the first, but replacing $ with \( and \). Note, it was crucial we had this second program occur after the offset mathmode regex, since $ would match $$.

**Exercise:** Adapt this solution to support Problem (4), support for literal \$ dollar signs. *Hint:* you can either try to upgrade the regular expression to not be tricked into thinking \$ is a delimiter, or you can add extra programs before that prevent \$ from being a problem. *Warning:* this exercise may cause a fit.

It can feel like a herculean accomplishment to successfully apply regular expressions to a problem. You can appreciate the now-classic programmer joke from the webcomic xkcd:

However, as you can tell, getting regular expressions right is hard and takes practice. It’s great when someone else solves your problem exactly, and you can copy/paste for a one-time fix. But debugging regular expressions that *don’t quite work* can be excruciating. There is another way!

While regular expressions have a clean theory of “character by character stateless processing”, they are limited. It’s not possible to express the concept of “memory” in a regular expression, and the simplest example of this is the problem of counting. Suppose you want to find strings that constitute valid, balanced parentheticals. E.g., this is balanced:

(hello (()there)() wat)

But this is not

(hello ((there )() wat)

This is impossible for regexes to handle because counting the opening parentheses is required to match the closing parens, and regexes can’t count arbitrarily high. If you want to parse and manipulate structures like this, that have balance and nesting, regex will only bring you heartache.

The next level up from regular expressions and regular languages are the two equivalent theories of context-free grammars and pushdown automata. A pushdown automaton is literally a regular expression (a finite state machine) equipped with a simple kind of memory called a stack. Rather than dwell on the mechanics, we’ll see how context-free grammars work, since if you can express your document as a context free grammar, a tool called a *parser generator* will give you a parsing program for free. Then a few simple lines of code allow you to manipulate the parsed representation, and produce the output document.

The standard (abstract) notation of a context-free grammar is called Extended Backus-Naur Form (EBNF). It’s a “metasyntax”, i.e., a syntax for describing syntax. In EBNF, you describe *rules* and *terminals*. Terminals are sequences of constant patterns, like

OFFSET_DOLLAR_DELIMITER = $$ OFFSET_LEFT_DELIMITER = \[ OFFSET_LEFT_DELIMITER = \]

A rule is an “or” of sequences of other rules or terminals. It’s much easier to show an example:

char = "a" | "b" | "c" offset = OFFSET_DOLLAR_DELIMITER char OFFSET_DOLLAR_DELIMITER | OFFSET_LEFT char OFFSET_RIGHT

The above describes the structure of any string that looks like offset math mode, but with a single “a” or a single “b” or a single “c” inside, e.g, “\[b\]”. You can see some more complete examples on Wikipedia, though they use a slightly different notation.

With some help from a practical library’s built-in identifiers for things like “arbitrary text” we can build a grammar that covers all of the ways to do latex math mode.

latex = content content = content mathmode content | TEXT | EMPTY mathmode = OFFSETDOLLAR TEXT OFFSETDOLLAR | OFFSETOPEN TEXT OFFSETCLOSE | INLINEOPEN TEXT INLINECLOSE | INLINE TEXT INLINE INLINE = $ INLINEOPEN = \( INLINECLOSE = \) OFFSETDOLLAR = $$ OFFSETOPEN = \[ OFFSETCLOSE = \]

Here we’re taking advantage of the fact that we can’t nest mathmode inside of mathmode in LaTeX (you probably can, but I’ve never seen it), by defining the mathmode rule to contain only text, and not other instances of the “content” rule. This rules out some ambiguities, such as whether “$x$ y $z$” is a nested mathmode or not.

We may not need the counting powers of context-free grammars, yet EBNF is easier to manage than regular expressions. You can apply context-sensitive rules to matches, whereas with regexes that would require coordination between separate passes. The order of operations is less sensitive; because the parser generator knows about all patterns you want to match in advance, it will match longer terminals before shorter—more ambiguous—terminals. And if we wanted to do operations on all four kinds of math mode, this allows us to do so without complicated chains of regular expressions.

The history of parsers is long and storied, and the theory of generating parsing programs from specifications like EBNF is basically considered a solved problem. However, there are a *lot* of parser generators out there. And, like regular expression engines, they each have their own flavor of EBNF—or, as is more popular nowadays, they have you write your EBNF using the features of the language the parser generator is written in. And finally, a downside of using a parser generator is that you have to then write a program to operate on the parsed representation (which also differs by implementation).

We’ll demonstrate this process by using a Python library that, in my opinion, stays pretty faithful to the EBNF heritage. It’s called lark and you can pip-install it as

pip install lark-parser

Note: the hard-core industry standard parser generators are antlr, lex, and yacc. I would not recommend them for small parsing jobs, but if you’re going to do this as part of a company, they are weighty, weathered, well-documented—unlike lark.

Lark is used entirely inside python, and you specify the EBNF-like grammar as a string. For example, ours is

tex: content+ ?content: mathmode | text+ mathmode: OFFSETDOLLAR text+ OFFSETDOLLAR | OFFSETOPEN text+ OFFSETCLOSE | INLINEOPEN text+ INLINECLOSE | INLINE text+ INLINE INLINE: "$" INLINEOPEN: "\\(" INLINECLOSE: "\\)" OFFSETDOLLAR: "$$" OFFSETOPEN: "\\[" OFFSETCLOSE: "\\]" ?text: /./s

You can see the similarities with our “raw” EBNF. The main difference here is the use of + for matching “one or more” of a rule, the use of a regular expression to define the “text” rule as any character (here again the trailing “s” means: allow the dot character to match newline characters). The backslashes are needed because backslash is an escape character in Python. And finally, the question mark tells lark to try to compress the tree if it only matches one item (you can see what the difference is by playing with our display-parsed-tree.py script that shows the parsed representation of the input document. You can read more in lark’s documentation about what the structure of the parsed tree is as python objects (Tree for rule/terminal matches and Token for individual characters).

For the input “Let $x=0$”, the parsed tree is as follows (note that the ? makes lark collapse the many “text” matches):

```
Tree(tex,
[Tree(content,
[Token(__ANON_0, 'L'),
Token(__ANON_0, 'e'),
Token(__ANON_0, 't'),
Token(__ANON_0, ' ')]),
Tree(mathmode,
[Token(INLINE, '$'),
Token(__ANON_0, 'x'),
Token(__ANON_0, '='),
Token(__ANON_0, '0'),
Token(INLINE, '$')]),
Token(__ANON_0, '\n')])
```

So now we can write a simple python program that traverses this tree and converts the delimiters. The entire program is on Github, but the core is

def join_tokens(tokens): return ''.join(x.value for x in tokens) def handle_mathmode(tree_node): '''Switch on the different types of math mode, and convert the delimiters to the desired output, and then concatenate the text between.''' starting_delimiter = tree_node.children[0].type if starting_delimiter in ['INLINE', 'INLINEOPEN']: return '\\(' + join_tokens(tree_node.children[1:-1]) + '\\)' elif starting_delimiter in ['OFFSETDOLLAR', 'OFFSETOPEN']: return '\\[' + join_tokens(tree_node.children[1:-1]) + '\\]' else: raise Exception("Unsupported mathmode type %s" % starting_delimiter) def handle_content(tree_node): '''Each child is a Token node whose text we'd like to concatenate together.''' return join_tokens(tree_node.children)

The rest of the program uses lark to create the parser, reads the file from standard input, processes the parsed representation, and outputs the converted document to standard output. You can use the program like this:

python convert-delimiters.py < input.tex > output.tex

**Exercise:** extend this grammar to support literal dollar signs using \$, and passes them through to the output document unchanged.

I personally prefer regular expressions when the job is quick. If my text manipulation rule fits on one line, or can be expressed without requiring “look ahead” or “look behind” rules, regex is a winner. It’s also a winner when I only expect it to fail in a few exceptional cases that can easily be detected and fixed by hand. It’s faster to write a scrappy regex, and then open the output in a text editor and manually fix one or two mishaps, than it is to write a parser.

However, the longer I spend on a regular expression problem—and the more frustrated I get wrestling with it—the more I start to think I should have used a parser all along. This is especially true when dealing with massive jobs. Such as converting delimiters in hundreds of blog articles, each thousands of words long, or making changes across all chapter files of a book.

When I need something in between rigid structure and quick-and-dirty, I actually turn to vim. Vim has this fantastic philosophy of “act, repeat, rewind” wherein you find an edit that applies to the thing you want to change, then you search for the next occurrence of the start, try to apply the change again, visually confirm it does the right thing, and if not go back and correct it manually. Learning vim is a major endeavor (for me it feels lifelong, as I’m always learning new things), but since I spend most of my working hours editing structured text the investment and philosophy has paid off.

Until next time!

]]>But being able to track and understand your habits is a *good* thing. It encourages you to be healthier, more financially responsible, or to do more ultimately gratifying activities outside of staring at your phone or computer. If a tracker app is the difference between an alcoholic sticking to their AA plan and a relapse, you shouldn’t have to give up your privacy for it.

Rather than sell my data for convenience, I’ve recently started to make my own tracking apps. Here’s how:

- Make a Google Form for entering data.
- Analyze that data in the linked spreadsheet.

Here’s an example I made as a demo, but which has a real analogue that I use to track bullshit work I have to do at my job, and how long it would take to avoid it. When the amount of time wasted exceeds the time for a permanent fix, I can justify delaying other work. It’s called the Churn Log.

And the linked spreadsheet with the raw response data looks like:

These are super fast to make, and have a number of important benefits:

- I can make them for whatever purpose I want, I don’t need to wait for some software engineers to happen to make an app that fits my needs. One other example I made is a “gift idea log.” I don’t think anyone will ever make this app.
- It lives on my phone just like other apps, since (on Android) you can save a link to a webpage as an icon as if it were a native app.
- It’s fast and uses minimal data.
- You can use it trivially with family members and friends.
- I get an incentive to become a spreadsheet wizard, which makes me better at my job.

The downsides are that it’s not *as* convenient as being completely automated. For instance, a finance tracker app can connect to your credit card account to automatically extract purchase history and group it into food, bills, etc. But then again, if I just want a tracker for my food purchases I have to give up my book purchases, my alcohol purchases (so many expensive liqueurs), and my obsession with bowties. With the Google Form method, I can quickly enter some data when I’m checking out at the grocery store or paying a check when dining out, and then when I’m interested I can go into the spreadsheet, make a chart or compute some averages, and I have 90% of the insight I care about.

But wait, doesn’t Google then have all your data? Can’t it sell it and send you unwanted magazines?

You’re right that technically Google gets all the data you enter. But with Mobile Legends: Health Tracker! (not a real app) they get to pick the structure of your entered data, so they know exactly what you’re entering. Since Google Forms lets you build a form with arbitrary semantics, it’s virtually impossible that enough people will choose the exact same structure that Google could feasibly be able to make sense of it.

And even if Google *wanted* to be evil and sell your self-tracked data, it wouldn’t be cost effective for Google to do so. The amount of work required to construct a lucrative interpretation of the random choices that humans make in building their own custom tracker app would far outweigh the gains from selling the data. The only reason that little apps like Mobile Legends Health Tracker can make money selling your data is that they suck up system metrics in a structured format whose semantics are known in advance. *Disclosure:* I work for Google, they aren’t paying me to write this—I honestly believe it’s a good idea—and having seen Google’s project management and incentive structure from the inside, I feel confident that custom tracker app data isn’t worthwhile enough to invest in parsing and exploiting. Not even to mention how much additional scrutiny Google gets from regulators.

While making apps like this I’ve actually learned a *ton* about spreadsheets that I never knew. For example, you can select an infinite range—e.g., an entire column—and make a chart that will auto-update as the empty cells get filled with new data. You can also create static references to *named cells* to act as configuration constants using dollar signs.

Even better, Google Sheets has two ways to interact with it externally. You can write Google Apps Script which is a flavor of Javascript that allows you to do things like send email alerts on certain conditions. E.g., if you tracked your dining budget you could get an email alert when you’re getting close to the limit. Or you could go full engineer and use the Google Sheets Python API to write whatever program you want to analyze your data. I sketched out a prototype scheduler app where the people involved entered their preferences via a Google Form, and I ran a Python script to pull the data and find a good arrangement that respected people’s preferences. That’s not a tracker app, but you can imagine arbitrarily complicated analysis of your own tracked data.

The beauty of this method is that it puts the power back in your hands, and has a gradual learning curve. If you or your friend has never written programs before, this gives an immediate and relevant application. You can start with simple spreadsheet tools (SUM and IF macros, charting and cross-sheet references), graduate to Apps script for scheduled checks and alerts, and finally to a fully fledged programming language, if the need arises.

I can’t think of a better way to induct someone into the empowering world of Automating Tedious Crap and gaining insights from data. We as programmers (and generally tech-inclined people) can help newcomers get set up. And my favorite part: most useful analyses require learning just a little bit of math and statistics

]]>The solution is in a paper of Rodrigo Restrepo from the 1950s. In this post I’ll start detailing how I study this paper, and talk through my thought process for approaching a bag of theorems and proofs. If you want to follow along, I re-typeset the paper on Github.

The Introduction starts with a summary of the setting of game theory. I remember most of this so I will just summarize the basics of the field. Skip ahead if you already know what the minimax theorem is, and what I mean when I say the “value” of a game.

A two-player *game* consists of a set of actions for each player—which may be finite or infinite, and need not be the same for both players—and a *payoff* function for each possible choice of actions. The payoff function is interpreted as the “utility” that player 1 gains and player 2 loses. If the payoff is negative, you interpret it as player 1 losing utility to player 2. Utility is just a fancy way of picking a common set of units for what each player treasures in their heart of hearts. Often it’s stated as money and we assume both players value cash the same way. Games in which the utility is always “one player gains exactly the utility lost by the other player” are called *zero-sum*.

With a finite set of actions, the payoff function is a table. For rock-paper-scissors the table is:

Rock, paper: -1

Rock, scissors: 1

Rock, rock: 0

Paper, paper: 0

Paper, scissors: -1

Paper, rock: 1

Scissors, paper: 1

Scissors, scissors: 0

Scissors, rock: -1

You could arrange this in a matrix and analyze the structure of the matrix, but we won’t. It doesn’t apply to our forthcoming setting where the players have infinitely many strategies.

A *strategy *is a possibly-randomized algorithm (whose inputs are just the data of the game, not including any past history of play) that outputs an action. In some games, the optimal strategy is to choose a single action no matter what your opponent does. This is sometimes called a *pure, **dominating* strategy, not because it dominates your opponent, but because it’s better than all of your other options no matter what your opponent does. The output action is deterministic.

However, as with rock-paper-scissors, the optimal strategy for most interesting games requires each player to act randomly according to a fixed distribution. Such strategies are called *mixed *or *randomized.* For rock-paper-scissors, the optimal strategy is to choose rock, paper, and scissors with equal probability. Computers are only better than humans at rock-paper-scissors because humans are bad at behaving consistently and uniformly random.

The famous minimax theorem says that every two-player zero-sum game has an optimal strategy for each player, which is possibly randomized. This strategy is optimal in the sense that it maximizes your expected winnings no matter what your opponent does. However, if your opponent is playing a particularly suboptimal strategy, the minimax solution might not be as good as a solution that takes advantage of the opponent’s dumb choices. A uniform random rock-paper-scissors strategy is not optimal if your opponent always plays “rock.” However, the optimal strategy doesn’t need special knowledge or space to store information about past play. If you played against God, you would blindly use the minimax strategy and God would have no upper hand. I wonder if the pope would have excommunicated me for saying that in the 1600’s.

The expected winnings for player 1 when both players play a minimax-optimal strategy is called the *value* of the game, and this number is unique (even if there are possibly multiple optimal strategies). If a game is symmetric—meaning both players have the same actions and the payoff function is symmetric—then the value is guaranteed to be zero. The game is fair.

The version of the minimax theorem that most people use (in particular, the version that often comes up in theoretical computer science) shows that finding an optimal strategy is equivalent to solving a linear program. This is great because it means that any such (finite) game is easy to solve. You don’t need insight; just compile and run. The minimax theorem is also true for sufficiently well-behaved continuous action spaces. The silent duel is well-behaved, so our goal is to compute an explicit, easy-to-implement strategy that the minimax theorem guarantees exists. As a side note, here is an example of a poorly-behaved game with no minimax optimum.

While the minimax theorem guarantees optimal strategies and a value, the concept of the “value” of the game has an independent definition:

Let be finite sets of actions for players 1, 2 respectively, and be strategies, i.e., probability distributions over and so that is the probability that is chosen. Let be the payoff function for the game. The *value of the game* is a real number such that there exist two strategies with the two following properties. First, for every fixed ,

(no matter what player 2 does, player 1’s strategy guarantees at least payoff), and for every fixed ,

(no matter what player 1 does, player 2’s strategy prevents a loss of more than ).

Since silent duels are continuous, Restrepo opens the paper with the corresponding definition for continuous games. Here a probability distribution is the same thing as a “positive measure with total measure 1.” Restrepo uses and for the strategies, and the corresponding statement of expected payoff for player 1 is that, for all fixed actions ,

And likewise, for all ,

All of this background gets us through the very first paragraph of the Restrepo paper. As I elaborate in my book, this is par for the course for math papers, because written math is optimized for experts already steeped in the context. Restrepo assumes the reader knows basic game theory so we can get on to the details of his construction, at which point he slows down considerably to focus on the details.

Starting in section 2, Restrepo describes the construction of the optimal strategy, but first he explains the formal details of the setting of the game. We already know the two players are taking and actions between , but we also fix the probability of success. Player 1 knows a distribution on for which is the probability of success when acting at time . Likewise, player 2 has a possibly different distribution , and (crucially) both *increase continuously* on . (In section 3 he clarifies further that satisfies , and , likewise for .) Moreover, both players know *both* . One could say that each player has an estimate of their opponent’s firing accuracy, and wants to be optimal compared to that estimate.

The payoff function is defined informally as: 1 if Player one succeeds before Player 2, -1 if Player 2 succeeds first, and 0 if both players exhaust their actions before the end and none succeed. Though Restrepo does not state it, if the players act and succeed at the same time—say both players fire at time —the payoff should also be zero. We’ll see how this is converted to a more formal (and cumbersome!) mathematical definition in a future post.

Next we’ll describe the statement of the fully general optimal strategy (which will be essentially meaningless, but have some notable features we can infer information from), and get a sneak peek at how to build this strategy algorithmically. Then we’ll see a simplified example of the optimal strategy.

The optimal strategy presented depends only on the values (the number of actions each player gets) and their success probability distributions . For player 1, the strategy splits up into subintervals

Crucially, this strategy *ignores* the initial interval . In each other subinterval Player 1 attempts an action at a time chosen by a probability distribution specific to that interval, independently of previous attempts. But no matter what, there is some initial wait time during which no action will ever be taken. This makes sense: if player 1 fired at time 0, it is a guaranteed wasted shot. Likewise, firing at time 0.000001 is basically wasted (due to continuity, unless is obnoxiously steep early on).

Likewise for player 2, the optimal strategy is determined by numbers resulting in intervals with .

The difficult part of the construction is describing the distributions dictating when a player should act during an interval. It’s difficult because an interval for player 1 and player 2 can overlap partially. Maybe and . Player 1 knows that Player 2 (using their corresponding minimax strategy) must act before time , and gets another chance after that time. This suggests that the distribution determining when Player 1 should act within may have a discontinuous jump at .

Call the distribution for Player 1 to act in the interval . Since it is a continuous distribution, Restrepo uses for the cumulative distribution function and for the probability density function. Then these functions are defined by (note this should be mostly meaningless for the moment)

where is defined as

The constants and are related by the equation

where

What can we glean from this mashup of symbols? The first is that (obviously) the distribution is zero outside the interval . Within it, there is this mysterious that is related to the used to define the *next* interval’s probability. This suggests we will likely build up the strategy in reverse starting with as the “base case” (if , then it is the only one).

Next, we notice the curious definition of . It unsurprisingly requires knowledge of both and , but the coefficient is strangely chosen: it’s a product over all failure probabilities () of all interval-starts happening later for the opponent.

[Side note: it’s very important that this is a constant; when I first read this, I thought that it was , which makes the eventual task of integrating *much* harder.]

Finally, the last interval (the one ending at ) may include the option to simply “wait for a guaranteed hit,” which Restrepo calls a “discrete mass of at .” That is, may have a different representation than the rest. Indeed, at the end of the paper we will find that Restrepo gives a base-case definition for that allows us to bootstrap the construction.

Player 2’s strategy is the same as Player 1’s, but replacing the roles of in the obvious way.

As with most math research, the best way to parse a complicated definition or construction is to simplify the different aspects of the problem until they become tractable. One way to do this is to have only a single action for both players, with . Restrepo provides a more general example to demonstrate, which results in the five most helpful lines in the paper. I’ll reproduce them here verbatim:

EXAMPLE. Symmetric Game: and . In this case the two

players have the same optimal strategies; , and . Furthermore

Saying means there is no “wait until to guarantee a hit”, which makes intuitive sense. You’d only want to do that if your opponent has exhausted all their actions before the end, which is only likely to happen if they have fewer actions than you do.

When Restrepo writes , there are a few things happening. First, we confirm that we’re working backwards from . Second, he’s implicitly saying “*choose* such that has the desired cumulative density.” After a bit of reflection, there’s no other way to specify the except implicitly: we don’t have a formula for to lean on.

Finally, the definition of the density function helps us understand under what conditions the probability function would be increasing or decreasing from the start of the interval to the end. Looking at the expression , we can see that polynomials will result in an expression dominated by for some , which is decreasing. By taking the derivative, an increasing density would have to be built from a satisfying . However, I wasn’t able to find any examples that satisfy this. Polynomials, square roots, logs and exponentials, all seem to result in decreasing density functions.

Finally, we’ll plot two examples. The first is the most reductive: , and . In this case , and there is only one term , for which . Then . (For verification, note the integral of on is indeed 1).

Note that the reason is so nice is that is so simple. If were, say, , then should shift to being . If were more complicated, we’d have to invert it (or use an approximate search) to find the location for which .

Next, we loosen the example to let , still with . In this case, we have the same final interval . The new actions all occur in the time before , in the intervals If there were more actions, we’d get smaller inverse-of-odd-spaced intervals approaching zero. The probability densities are now steeper versions of the same , with the constant getting smaller to compensate for the fact that gets larger and maintain the normalized distribution. For example, the earliest interval results in . Closer to zero the densities are somewhat shallower compared to the size of the interval; for example in the density toward the beginning of the interval is only about twice as large as the density toward the end.

Since the early intervals are getting smaller and smaller as we add more actions, the optimal strategy will resemble a burst of action at the beginning, gradually tapering off as the accuracy increases and we work through our budget. This is an explicit tradeoff between the value of winning (lots of early, low probability attempts) and keeping some actions around for the end where you’re likely to succeed.

At this point, we’ve parsed the general statement of the theorem, and while much of it is still mysterious, we extracted some useful qualitative information from the statement, and tinkered with some simple examples.

At this point, I have confidence that the simple symmetric example Restrepo provided is correct; it passed some basic unit tests, like that each is normalized. My next task in fully understanding the paper is to be able to derive the symmetric example from the general construction. We’ll do this next time, and include a program that constructs the optimal solution for any input.

Until then!

]]>