A Good Year for “A Programmer’s Introduction to Mathematics”

A year ago today I self-published “A Programmer’s Introduction to Mathematics” (PIM). In this short note I want to describe the success it’s had, summarize the complaints of some readers and the praise of others, and outline what’s next.

Since publication PIM has sold over 11,000 copies.

Paperback and Ebook.png

A rough chart showing the sales of paperback and ebook copies of PIM.

Here’s a table:

Month Paperback Ebook Total
2018-12 4,323 1,866 6,189
2019-01 1,418 258 1,676
2019-02 852 128 980
2019-03 762 107 869
2019-04 357 63 420
2019-05 289 51 340
2019-06 200 58 258
2019-07 223 40 263
2019-08 158 43 201
2019-09 159 24 183
2019-10 155 44 199

As expected, the sales dropped significantly after the first month, but leveled out at around 200 copies per month. With self-publishing royalties are generally high, and in the end I make a minimum of $15 on each sale. I’m thrilled. The book has been successful well beyond my expectations. I remember joking with friends that I’d call the project a failure if I didn’t sell a thousand copies, and my wife joking that if I sold twenty thousand I should quit my job. Reality found a happy middle ground.

Outside of my normal activity online, a pre-publication mailing list, and a $400/month cap on a Google Ads campaign, I’ve done little to promote the book. I don’t track conversion rates (pimbook.org is javascript-free). Though I can’t control what Amazon and Google collect on my behalf, I don’t look at it. I have an ad on the sidebar of my blog, which I suspect drives most of the sales via organic search results for various math topics.

Combining all the costs together, from Google Ads, Gumroad premium, tech review, and sending signed copies to Patreon patrons—but not including my time—a year of sales has cost about $4,500.

Not all readers enjoyed the book. As the Amazon reviewer “SCB” put it, “the book feels more like a refresher than an introduction.” User “zach” concurred, calling it “not a kind introduction,” and user “MysteryGuest” said its “difficulty ramps up quickly.” The other two emphatic criticisms I heard were that I did not include solutions for the exercises, nor did I provide a version of the ebook in standard ebook formats.

I won’t waste your time here rebutting critics, and I broadly agree with their claims. The difficulty does ramp, I didn’t provide solutions, and I didn’t make a Kindle version. I have no plans to change that as of now.

On the other hand, many readers enjoyed the book. Amazon reviewer “B.W.” remarked, “First time I’ve ever read about BigO and actually had it ‘click’.” User “Scuba Enthusiast” said the book “conveys the cultural assumptions, conventions, and notations in a way that someone familiar with programming is able to ‘grok’.”

Many readers have also told me via less public means they enjoyed the book. A colleague of mine at Google—who is an excellent engineer but not all that interested in math—had the need to dig into some mathematical notation describing an integer linear program his team was responsible for. He said reading the first few chapters of my book helped him decipher the symbols and ultimately understand the model. Later he told me he had started to design his own optimization models.

I was also delighted to read Tim Urian’s thoughts, which he recently posted a public Google doc. He detailed his struggles with reading the book, and how he overcame those struggles.

Then there was a public form I put up for submitting errata. Users submitted over 200 errors, comments, and clarifications. I’ve addressed them for the second edition, and I’m glad to say most mathematical errors are not substantial, and I recognize when I wrote one thing, later changed my mind about some detail (like indexing from zero or one), and did not adequately update what was already written.

Speaking of the second edition, I’ll be releasing a second edition early next year. In addition to the errata, I’ve spruced up a number of the exercises, improved explanations of some things, added helpful reminders in spots readers often got confused, and I’m working on a new short chapter about proofs and logical foundations. Chances are good that it’s not enough to justify buying a second copy, but I think after the second edition I’ll feel comfortable calling the book “done” for the foreseeable future. In the mean time, I’ve made the ebook for the first edition “pay what you want,” so you don’t feel cheated if you buy the ebook the day before the second edition comes out. I will probably leave the first edition ebook free after the second edition comes out.

I’m also very slowly working on the converse book, tentatively titled “A Mathematician’s Introduction to Programming.”

On a related note, I’m sad to say I haven’t taken more time to write blog posts. As my responsibilities at work have grown, I’ve become more swamped and been less motivated. That’s also been coupled with some blog projects I have tried but failed to make meaningful progress on, and I feel bad abandoning them because I feel there is no principled reason I can’t finish them.

Postscript: I’ve been asked a few times about the rights to foreign translations and distributions of the book. I have no experience in that, and I’ve basically been ignoring all incoming requests. If anyone has advice about how to navigate this, and whether it’s worth the effort for a solo operation like my own, I’d be open to hear it.

Silent Duels—Constructing the Solution part 1

Previous posts in this series:

Silent Duels and an Old Paper of Restrepo
Silent Duels—Parsing the Construction

Last time we waded into Restrepo’s silent duel paper. You can see the original and my re-typeset version on Github along with all of the code in this series. We digested Section 2 and a bit further, plotting some simplified examples of the symmetric version of the game.

I admit, this paper is not an easy read. Sections 3-6 present the central definitions, lemmas, and proofs. They’re a slog of dense notation with no guiding intuition. The symbols don’t even get catchy names or “think of this as” explanations! I think this disparity in communication has something to do with the period in which the paper was written, and something to do with the infancy of computing at the time. The paper was published in 1957, which marked the year IBM switched from vacuum tubes to transistors, long before Moore’s Law was even a twinkle in Gordon Moore’s eye. We can’t blame Restrepo for not appealing to our modern sensibilities.

I spent an embarrassing amount of time struggling through Sections 3-6 when I still didn’t really understand the form of the optimal strategy. It’s not until the very end of the paper (Section 8, the proof of Theorem 1) that we get a construction. See the last post for a detailed description of the data that constitutes the optimal strategy. In brief, it’s a partition of [0,1] into subintervals [a_i, a_{i+1}] with a probability distribution F_i on each interval, and the time you take your i-th action is chosen randomly according to F_i. Optionally, the last distribution can include a point-mass at time t=1, i.e., “wait until the last moment for a guaranteed hit.”

Section 8 describes how to choose the a_i‘s and b_j‘s, with the distributions F_i and G_j built according to the formulas built up in the previous sections.

Since our goal is still to understand how to construct the solution—even if why it works is still a mystery—we’ll write a program that implements this algorithm in two posts. First, we’ll work with a simplified symmetric game, where the answer is provided for us as a test case. In a followup post, we’ll rework the code to construct the generic solution, and pick nits about code quality, and the finer points of the algorithm Restrepo leaves out.

Ultimately, if what the program outputs matches up with Restrepo’s examples (in lieu of understanding enough of the paper to construct our own), we will declare victory—we’ll have successfully sanity-checked Restrepo’s construction. Then we can move on to studying why this solution works and what caveats hide beneath the math.

Follow the Formulas

The input to the game is a choice of n actions for player 1, m actions for player 2, and probability distributions P, Q for the two player’s success probabilities (respectively). Here’s the algorithm as stated by Restrepo, with the referenced equations following. If you’re following along with my “work through the paper organically” shtick, I recommend you try parsing the text below before reading on. Recall \alpha is the “wait until the end” probability for player 1’s last action, and \beta is the analogous probability for player 2.


Restrepo’s description of the algorithm for computing the optimal strategy.

Screen Shot 2019-02-09 at 5.18.17 PM

The equations referenced above

Let’s sort through this mess.

First, the broad picture. The whole construction depends on \alpha, \beta, these point masses for the players’ final actions. However, there’s this condition that \alpha \beta = 0, i.e., at most one can be nonzero. This makes some vaguely intuitive sense: a player with more actions will have extra “to spare,” and so it may make sense for them to wait until the very end to get a guaranteed hit. But only one player can have such an advantage over the other, so only one of the two parameters may be nonzero. That’s my informal justification for \alpha \beta = 0.

If we don’t know \alpha, \beta at the beginning, Restrepo’s construction (the choice of a_i‘s and b_j‘s) is a deterministic function of \alpha, \beta, and the other fixed inputs.

\displaystyle (\alpha, \beta) \mapsto (a_1, \dots, a_n, b_1, \dots, b_m)

The construction asserts that the optimal solution has a_1 = b_1 and we need to find an input (\alpha, \beta) \in [0,1) \times [0,1) such that \alpha \beta = 0 and they produce a_1 = b_1 = 1 as output. We’re doing a search for the “right” output parameters, and using knowledge about the chained relationship of equations to look at the output, and use it to tweak the input to get the output closer to what we want. It’s not gradient descent, but it could probably be rephrased that way.

In particular, consider the case when we get b_1 < a_1, and the other should be clear. Suppose that starting from \alpha = 0, \beta = 0 we construct all our a_i‘s and b_j‘s and get b_1 < a_1. Then we can try again with \alpha = 0, \beta = 1, but since \beta = 1 is illegal we’ll use \beta = 1 - \varepsilon for as small a \varepsilon as we need (to make the next sentence true). Restrepo claims that picking something close enough to 1 will reverse the output, i.e. will make a_1 < b_1. He then finishes with (my paraphrase), “obviously a_1, b_1 are continuous in terms of \beta, so a solution exists with a_1 = b_1 for some choice of \beta \neq 0; that’s the optimal solution.” Restrepo is relying on the intermediate value theorem from calculus, but to find that value the simplest option is binary search. We have the upper and lower bounds, \beta = 0 and \beta = 1 - \varepsilon, and we know when we found our target: when the output has a_1 = b_1.

This binary search will come back in full swing in the next post, since we already know that the symmetric silent duel starts with a_1 = b_1. No search for \alpha, \beta is needed, and we can fix them both to zero for now—or rather, assume the right values are known.

What remains is to determine how to compute the a_i‘s and b_j‘s from a starting \alpha, \beta. We’ll go through the algorithm step by step using the symmetric game where P=Q (same action success probability) and n=m (same action count) to ground our study. A followup post will revisit these formulas in full generality.

The symmetric game

The basic idea of the construction is that we start from a computation of the last action parameters a_n, b_m, and use those inductively to compute the parameters of earlier actions via a few integrals and substitutions. In other words, the construction is a recursion, and the interval in which the players take their last action [a_n, 1), [b_m, 1) is the base case. As I started writing the programs below, I wanted to give a name to these a_i, b_j values. Restrepo seems to refer to them as “parameters” in the paper. I call them transition times, since the mark the instants at which a player “transitions” from one action interval [a_2, a_3] to the next [a_3, a_4].

For a simple probability function P, the end of the algorithm results in equations similar to: choose a_t such that P(a_{n-t}) = 1/(2t + 3).

Recall, Player 1 has a special function used in each step to construct their optimal strategy, called f^* by Restrepo. It’s defined for non-symmetric game as follows, where recall Q is the opponent’s action probability:

\displaystyle f^*(t) = \prod_{b_j > t} (1-Q(b_j))\frac{Q'(t)}{Q^2(t)P(t)}

[Note the Q^2(t) is a product Q(t)Q(t); not an iterated function application.]

Here the b_j > t asks us to look at all the transition times computed in previous recursive steps, and compute the product of an action failure at those instants. This is the product \prod (1-Q(b_j)). This is multiplied by a mysterious fraction involving P, Q, which in the symmetric P=Q case reduces to P'(t) / P^3(t). In Python code, computing f^* is given below—called simply “f_star” because I don’t yet understand how to interpret it in a meaningful way. I chose to use SymPy to compute symbolic integrals, derivatives, and solve equations, so in the function below, prob_fun and prob_fun_var are SymPy expressions and lambdas.

from sympy import diff

def f_star(prob_fun, prob_fun_var, larger_transition_times):
    '''Compute f* as in Restrepo '57.

    In this implementation, we're working in the simplified example
    where P = Q (both players probabilities of success are the same).
    x = prob_fun_var
    P = prob_fun

    product = 1
    for a in larger_transition_times:
        product *= (1 - P(a))

    return product * diff(P(x), x) / P(x)**3

In this symmetric instance of the problem we already know that \alpha = \beta = 0 is the optimal solution (Restrepo states that in section 2), so we can fix \alpha=0, and compute a_n, which we do next.

In the paper, Restrepo says “compute without parameters in the definition of f^*” and this I take to mean, because there are no larger action instants, the product \prod_{b_j > t}1-P(b_j) is empty, i.e., we pass an empty list of larger_transition_times. Restrepo does violate this by occasionally referring to a_{n+1} = 1 and b_{m+1} = 1, but if we included either of these P(a_{n+1}) = P(1) = 0, and this would make the definition of f^* zero, which would produce a non-distribution, so that can’t be right. This is one of those cases where, when reading a math paper, you have to infer the interpretation that is most sensible, and give the author the benefit of the doubt.

Following the rest of the equations is trivial, except in that we are solving for a_n which is a limit of integration. Since SciPy works symbolically, however, we can simply tell it to integrate without telling it a_n, and ask it to solve for a_n.

from sympy import Integral
from sympy import Symbol
from sympy.solvers import solve
from sympy.functions.elementary.miscellaneous import Max

def compute_a_n(prob_fun, alpha=0):
    P = prob_fun
    t = Symbol('t0', positive=True)
    a_n = Symbol('a_n', positive=True)

    a_n_integral = Integral(
        ((1 + alpha) - (1 - alpha) * P(t)) * f_star(P, t, []), (t, a_n, 1))
    a_n_integrated = a_n_integral.doit()   # yes now "do it"
    P_a_n_solutions = solve(a_n_integrated - 2 * (1 - alpha), P(a_n))
    P_a_n = Max(*P_a_n_solutions)
    print("P(a_n) = %s" % P_a_n)

    a_n_solutions = solve(P(a_n) - P_a_n, a_n)
    a_n_solutions_in_range = [soln for soln in a_n_solutions if 0 &amp;lt; soln &amp;lt;= 1]
    assert len(a_n_solutions_in_range) == 1
    a_n = a_n_solutions_in_range[0]
    print("a_n = %s" % a_n)

    h_n_integral = Integral(f_star(P, t, []), (t, a_n, 1))
    h_n_integrated = h_n_integral.doit()
    h_n = (1 - alpha) / h_n_integrated
    print("h_n = %s" % h_n)

    return (a_n, h_n)

There are three phases here. First, we integrate and solve for a_n (blindly according to equation 27 in the paper). If you work out this integral by hand (expanding f^*), you’ll notice it looks like P'(t)/P^2(t), which suggests a natural substitution, u=P(t). After computing the integral (entering phase 2), we can maintain that substitution to first solve for P(a_n), say the output of that is some known number Z which in the code we call P_a_n, and then solve P(a_n) = z for a_n. Since that last equation can have multiple solutions, we pick the one between 0 and 1. Since P(t) must be increasing in [0,1], that guarantees uniqueness.

Note, we didn’t need to maintain the substitution in the integral, and perform a second solve. We could just tell sympy to solve directly for a_n, and it would solve P(a_n) = x in addition to computing the integral. But as I was writing, it was helpful for me to check my work in terms of the substitution. In the next post we’ll clean that code up a bit.

Finally, in the third phase we compute the h_n, which is a normalizing factor that ultimately ensures the probability distribution for the action in this interval has total probability mass 1.

The steps to compute the iterated lower action parameters (a_i for 1 \leq i < n) are similar, but the formulas are slightly different:

Note that the last action instant a_{i+1} and its normalizing constant h_{i+1} show up in the equation to compute a_i. In code, this is largely the same as the compute_a_n function, but in a loop. Along the way, we print some helpful diagnostics for demonstration purposes. These should end up as unit tests, but as I write the code for the first time I prefer to debug this way. I’m not even sure if I understand the construction well enough to do the math myself and write down unit tests that make sense; the first time I tried I misread the definition of f^* and I filled pages with confounding and confusing integrals!

from collections import deque

def compute_as_and_bs(prob_fun, n, alpha=0):
    '''Compute the a's and b's for the symmetric silent duel.'''
    P = prob_fun
    t = Symbol('t0', positive=True)

    a_n, h_n = compute_a_n(prob_fun, alpha=alpha)

    normalizing_constants = deque([h_n])
    transitions = deque([a_n])
    f_star_products = deque([1, 1 - P(a_n)])

    for step in range(n):
        # prepending new a's and h's to the front of the list
        last_a = transitions[0]
        last_h = normalizing_constants[0]
        next_a = Symbol('a', positive=True)

        next_a_integral = Integral(
            (1 - P(t)) * f_star(P, t, transitions), (t, next_a, last_a))
        next_a_integrated = next_a_integral.doit()
        # print("%s" % next_a_integrated)
        P_next_a_solutions = solve(next_a_integrated - 1 / last_h, P(next_a))
        print("P(a_{n-%d}) is one of %s" % (step + 1, P_next_a_solutions))
        P_next_a = Max(*P_next_a_solutions)

        next_a_solutions = solve(P(next_a) - P_next_a, next_a)
        next_a_solutions_in_range = [
            soln for soln in next_a_solutions if 0 &amp;lt; soln &amp;lt;= 1]
        assert len(next_a_solutions_in_range) == 1
        next_a_soln = next_a_solutions_in_range[0]
        print("a_{n-%d} = %s" % (step + 1, next_a_soln))

        next_h_integral = Integral(
            f_star(P, t, transitions), (t, next_a_soln, last_a))
        next_h = 1 / next_h_integral.doit()
        print("h_{n-%d} = %s" % (step + 1, next_h))

        print("dF_{n-%d} coeff = %s" % (step + 1, next_h * f_star_products[-1]))

        f_star_products.append(f_star_products[-1] * (1 - P_next_a))

    return transitions

Finally, we can run it with the simplest possible probability function: P(t) = t

x = Symbol('x')
compute_as_and_bs(Lambda((x,), x), 3)

The output is

P(a_n) = 1/3
a_n = 1/3
h_n = 1/4
P(a_{n-1}) is one of [1/5]
a_{n-1} = 1/5
h_{n-1} = 3/16
dF_{n-1} coeff = 1/8
P(a_{n-2}) is one of [1/7]
a_{n-2} = 1/7
h_{n-2} = 5/32
dF_{n-2} coeff = 1/12
P(a_{n-3}) is one of [1/9]
a_{n-3} = 1/9
h_{n-3} = 35/256
dF_{n-3} coeff = 1/16

This matches up so far with Restrepo’s example, since P(a_{n-k}) = a_{n-k} = 1/(2k+3) gives 1/3, 1/5, 1/7, 1/9, \dots. Since we have the normalizing constants h_i, we can also verify the probability distribution for each action aligns with Restrepo’s example. The constant in the point mass function is supposed to be h_i \prod_{j={i+1}}^n (1-P(a_j)). This is what I printed out as dF_{n-k}. In Restrepo’s example, this is expected to be \frac{1}{4(k+1)}, which is exactly what’s printed out.

Another example using P(t) = t^3:

P(a_n) = 1/3
a_n = 3**(2/3)/3
h_n = 1/4
P(a_{n-1}) is one of [-1/3, 1/5]
a_{n-1} = 5**(2/3)/5
h_{n-1} = 3/16
dF_{n-1} coeff = 1/8
P(a_{n-2}) is one of [-1/5, 1/7]
a_{n-2} = 7**(2/3)/7
h_{n-2} = 5/32
dF_{n-2} coeff = 1/12
P(a_{n-3}) is one of [-1/7, 1/9]
a_{n-3} = 3**(1/3)/3
h_{n-3} = 35/256
dF_{n-3} coeff = 1/16

One thing to notice here is that the normalizing constants don’t appear to depend on the distribution. Is this a coincidence for this example, or a pattern? I’m not entirely sure.

Next time we’ll rewrite the code from this post so that it can be used to compute the generic (non-symmetric) solution, see what they can tell us, and from there we’ll start diving into the propositions and proofs.

Until next time!

Math Versus Dirty Data

At Google, our organization designs, owns, and maintains a number of optimization models that automate the planning of Google’s datacenter growth and health. As is pretty standard in supply chain optimization and planning, these models are often integer linear programs. It’s a core competency of operations research, after all.

One might think, “Large optimization problems? That sounds hard!” But it’s actually far from the hardest part of the job. In fact, it’s one of the few exciting parts of the job. The real hard part is getting data. Really, it’s that you get promised data that never materializes, and then you architect your system for features that rot before they ripen.

There’s a classic image of a human acting as if they’re throwing a ball for a dog, and the dog sprints off, only soon to realize the ball was never thrown. The ball is the promise of freshly maintained data, and recently I’ve been the dog.

When you don’t have good data, or you have data that’s bad in a known way, you can always try to design your model to accommodate for the deficiencies. As long as it’s clearly defined, it’s not beyond our reach. The math is fun and challenging, and I don’t want to shy away from it. My mathematician’s instinct pulls me left.

My instincts as an engineer pull me right: data issues will ultimately cause unexpected edge cases at the worst moment, and it will fall on me to spend all day debugging for a deadline tomorrow. Data issues lead to more complicated modeling features which further interact with other parts of the model and the broader system in confounding ways. Worst of all, it’s nearly impossible to signal problems to customers who depend on your output. When technical debt is baked into an optimization model as features, it makes explanation much harder. Accepting bad data also requires you write the code in a way that is easy to audit, since you need to audit literally everything. Transparency is good, but it’s tedious to do it generically well, and the returns are not worth it if the end result is, “well we can’t fix the data for two years anyway.”

Though a lot of this technical debt was introduced by predecessors who left the team, I’ve fallen for the mathematical siren’s call a few times. Go on, just add that slick new constraint. Just mask that misbehavior with a heuristic. It’s bit back hard and caused months of drag.

These days I’m swinging hard right on the pendulum. Delete half-implemented features that don’t have data to support them. Delete features that don’t have a clear connection to business needs (even if they work). Push back on new feature requests until the data exists. Require a point of contact and an SLO for any data you don’t own. Make speculative features easy to turn on/off (or remove!) without having to redesign the architecture. If it can’t be made easy to remove, don’t add it until you’re sure it will survive.

If you can’t evade bad data, err on the side of strict initial validation, and doing nothing or gracefully degrading service when validation fails. Expose the failures as alerts to the people who own the data (not you), and give the data owners a tool that repeats the validation logic in your system verbatim, so there is no confusion on the criteria for success. When you have this view, almost all of the complexity in your system lies in enabling this generic auditing, alerting, and managing of intricate (but ultimately arbitrary) policy.

I like to joke that I don’t have data-intensive applications or problems of scale, but rather policy-intensive applications. I haven’t found much insight from other software engineers about how to design and maintain policy-intensive software. Let me know if you have some! The obvious first step is to turn policy code into data. To the extent that we’ve done this, I adore that aspect of our systems. Still, you can’t avoid it when policies need to be encoded in an optimization model.

I do get sad that so much of my time is spent poop-smithing, as I like to say, even though we’re gradually getting better. Our systems need maintenance and care, and strong principles to keep the thicket from overwhelming us. For one, I track net lines of code added, with the goal to have it be net negative month over month, new features and all. We’ve kept it up for six months so far. Even our fixit week this week seems unnecessary, given how well our team has internalized paying off technical debt.

Though I do wonder what it’s all for. So Google can funnel the money it saves on datacenter costs into informing people the Earth is flat cat videos? If I didn’t have two particular internal side projects to look forward to—they involve topics I’m very interested in—I’d be bored, and I might succumb to jaded feelings, and I’d need a change. Certain perks and particularly enjoyable colleagues help avoid that. But still, I rarely have time to work on the stimulating projects, and even the teammates I’ve been delegating to often defer it to other priorities.

We let dirty data interfere with our design and architecture, now we’re paying back all that technical debt, and as a consequence there’s no time for our human flourishing.

I should open a math cafe.

A Working Mathematician’s Guide to Parsing

Our hero, a mathematician, is writing notes in LaTeX and needs to convert it to a format that her blog platform accepts. She’s used to using dollar sign delimiters for math mode, but her blog requires \( \) and \[ \]. Find-and-replace fails because it doesn’t know about which dollar sign is the start and which is the end. She knows there’s some computer stuff out there that could help, but she doesn’t have the damn time to sort through it all. Paper deadlines, argh!

If you want to skip ahead to a working solution, and you can run basic python scripts, see the Github repository for this post with details on how to run it and what the output looks like. It’s about 30 lines of code, and maybe 10 of those lines are not obvious. Alternatively, copy/paste the code posted inline in the section “First attempt: regular expressions.”

In this post I’ll guide our hero through the world of text manipulation, explain the options for solving a problem like this, and finally explain how to build the program in the repository from scratch. This article assumes you have access to a machine that has basic programming tools pre-installed on it, such as python and perl.

The problem with LaTeX

LaTeX is great, don’t get me wrong, but people who don’t have experience writing computer programs that operate on human input tend to write sloppy LaTeX. They don’t anticipate that they might need to programmatically modify the file because the need was never there before. The fact that many LaTeX compilers are relatively forgiving with syntax errors exacerbates the issue.

The most common way to enter math mode is with dollar signs, as in

Now let $\varepsilon > 0$ and set $\delta = 1/\varepsilon$.

For math equations that must be offset, one often first learns to use double-dollar-signs, as in

First we claim that $$0 \to a \to b \to c \to 0$$ is a short exact sequence

The specific details that make it hard to find and convert from this delimiter type to another are:

  1. Math mode can be broken across lines, but need not be.
  2. A simple search and replace for $ would conflict with $$.
  3. The fact that the start and end are symmetric means a simple search and replace for $$ fails: you can’t tell whether to replace it with \[ or \] without knowing the context of where it occurs in the document.
  4. You can insert a dollar sign in LaTeX using \$ and it will not enter math mode. (I do not solve this problem in my program, but leave it as an exercise to the reader to modify each solution to support this)

First attempt: regular expressions

The first thing most programmers will tell you when you have a text manipulation problem is to use regular expressions (or regex). Regular expressions are text patterns that a program called a regular expression engine uses to find subsets of text in a document. This can often be with the goal of modifying the matched text somehow, but also just to find places where the text occurs to generate a report.

In their basic form, regular expressions are based on a very clean theory called regular languages, which is a kind of grammar equivalent to “structure that can be recognized by a finite state machine.”

[Aside: some folks prefer to distinguish between regular expressions as implemented by software systems (regex) and regular expressions as a representation of a regular language; as it turns out, features added to regex engines make them strictly stronger than what can be represented by the theory of regular languages. In this post I will use “regex” and “regular expressions” both for practical implementations, because programmers and software don’t talk about the theory, and use “regular languages” for the CS theory concept]

The problem is that practical regular expressions are difficult and nit-picky, especially when there are exceptional cases to consider. Even matching something like a date can require a finnicky expression that’s hard for humans to read and debug when they are incorrect. A regular expression for a line in a file that contains a date by itself:

^\s*(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])\s*$

Worse, regular expressions are rife with “gotchas” having to do with escape characters. For example, parentheses are used for something called “capturing”, so you have to use \( to insert a literal parenthesis. If  having to use “\$” in LaTeX bothers you, you’ll hate regular expressions.

Another issue comes from history. There was a time when computers only allowed you to edit a file one line at a time. Many programming tools and frameworks were invented during this time that continue to be used today (you may have heard of sed which is a popular regular expression find/replace program—one I use almost daily). These tools struggle to operate on problems that span many lines of a file, because they simply weren’t designed for that. Problem (1) above suggests this might be a problem.

Yet another issue is in slight discrepancies between regex engines. Perl, python, sed, etc., all have slight variations and “nonstandard” features. As all programmers know, every visible behavior of a system will eventually be depended on by some other system.

But the real core problem is that regular expressions weren’t really designed for knowing about the context where a match occurs. Regular expressions are designed to be character-at-a-time pattern matching. [edit: removed an incorrect example] But over time, regular expression engines have added features to do such things over the years (which makes them more powerful than the original, formal definition of a regular language, and even more powerful than what parsers can handle!), but the more complicated you make a regular expression, the more likely it’s going to misbehave on odd inputs, and less likely others can use it without bugs or modification for their particular use case. Software engineers care very much about such things, though mathematicians needing a one-off solution may not.

One redeeming feature of regular expressions is that—by virtue of being so widely used in industry—there are many tools to work with them. Every major programming language has a regular expression engine built in. And many websites help explain how regexes work. regexr.com is one I like to use. Here is an example of using that website to replace offset mathmode delimiters. Note the “Explain” button, which traces the regex engine as it looks for matches.

Screen Shot 2019-04-20 at 3.25.41 PM

So applying this to our problem: we can use two regular expressions to solve the problem. I’m using the perl programming language because its regex engine supports multiline matches. All MacOS and linux systems come with perl pre-installed.

 perl -0777 -pe 's/\$\$(.*?)\$\$/\\[\1\\]/gs' < test.tex | perl -0777 -pe 's/\$(.*?)\$/\\(\1\\)/gs' > output.tex

Now let’s explain, starting with the core expressions being matched and replaced.

s/X/Y/ tells a regex engine to “substitute regex matches of X with Y”. In the first regex X is \$\$(.*?)\$\$, which breaks down as

  1. \$\$ match two literal dollar signs
  2. ( capture a group of characters signified by the regex between here and the matching closing parenthesis
    1. .* zero or more of any character
    2. ? looking at the “zero or more” in the previous step, try to match as few possible characters as you can while still making this pattern successfully match something
  3. ) stop the capture group, and save it as group 1
  4. \$\$ match two more literal dollar signs

Then Y is the chosen replacement. We’re processing offset mathmode, so we want \[ \]. Y is \\[\1\\], which means

  1. \\ a literal backslash
  2. [ a literal open bracket
  3. \1 the first capture group from the matched expression
  4. \\ a literal backslash
  5. ] a literal close bracket

All together we have s/\$\$(.*?)\$\$/\\[\1\\]/, but then we add a final s and g characters, which act as configuration. The “s” tells the regex engine to allow the dot . to match newlines (so a pattern can span multiple lines) and the “g” tells the regex to apply the substitution globally to every match it sees—as opposed to just the first. 

Finally, the full first command is

perl -0777 -pe 's/\$\$(.*?)\$\$/\\[\1\\]/gs' < test.tex 

This tells perl to read in the entire test.tex file and apply the regex to it. Broken down

  1. perl run perl
  2. -0777 read the entire file into one string. If you omit it, perl will apply the regex to each line separately.
  3. -p will make perl automatically “read input and print output” without having to tell it to with a “print” statement
  4. e tells perl to run the following command line argument as a program.
  5. < test.tex tells perl to use the file test.tex as input to the program (as input to the regex engine, in this case).

Then we pipe the output of this first perl command to a second one that does a very similar replacement for inline math mode.

<first_perl_command> | perl -0777 -pe 's/\$(.*?)\$/\\(\1\\)/gs'

The vertical bar | tells the shell executing the commands to take the output of the first program and feed it as input to the second program, allowing us to chain together sequences of programs. The second command does the same thing as the first, but replacing $ with \( and \). Note, it was crucial we had this second program occur after the offset mathmode regex, since $ would match $$.

Exercise: Adapt this solution to support Problem (4), support for literal \$ dollar signs. Hint: you can either try to upgrade the regular expression to not be tricked into thinking \$ is a delimiter, or you can add extra programs before that prevent \$ from being a problem. Warning: this exercise may cause a fit.

It can feel like a herculean accomplishment to successfully apply regular expressions to a problem. You can appreciate the now-classic programmer joke from the webcomic xkcd:

Regular Expressions

However, as you can tell, getting regular expressions right is hard and takes practice. It’s great when someone else solves your problem exactly, and you can copy/paste for a one-time fix. But debugging regular expressions that don’t quite work can be excruciating. There is another way!

Second attempt: using a parser generator

While regular expressions have a clean theory of “character by character stateless processing”, they are limited. It’s not possible to express the concept of “memory” in a regular expression, and the simplest example of this is the problem of counting. Suppose you want to find strings that constitute valid, balanced parentheticals. E.g., this is balanced:

(hello (()there)() wat)

But this is not

(hello ((there )() wat)

This is impossible for regexes to handle because counting the opening parentheses is required to match the closing parens, and regexes can’t count arbitrarily high. If you want to parse and manipulate structures like this, that have balance and nesting, regex will only bring you heartache.

The next level up from regular expressions and regular languages are the two equivalent theories of context-free grammars and pushdown automata. A pushdown automaton is literally a regular expression (a finite state machine) equipped with a simple kind of memory called a stack. Rather than dwell on the mechanics, we’ll see how context-free grammars work, since if you can express your document as a context free grammar, a tool called a parser generator will give you a parsing program for free. Then a few simple lines of code allow you to manipulate the parsed representation, and produce the output document.

The standard (abstract) notation of a context-free grammar is called Extended Backus-Naur Form (EBNF). It’s a “metasyntax”, i.e., a syntax for describing syntax. In EBNF, you describe rules and terminals. Terminals are sequences of constant patterns, like


A rule is an “or” of sequences of other rules or terminals. It’s much easier to show an example:

char = "a" | "b" | "c"


The above describes the structure of any string that looks like offset math mode, but with a single “a” or a single “b” or a single “c” inside, e.g, “\[b\]”. You can see some more complete examples on Wikipedia, though they use a slightly different notation.

With some help from a practical library’s built-in identifiers for things like “arbitrary text” we can build a grammar that covers all of the ways to do latex math mode.

latex = content
content = content mathmode content | TEXT | EMPTY


Here we’re taking advantage of the fact that we can’t nest mathmode inside of mathmode in LaTeX (you probably can, but I’ve never seen it), by defining the mathmode rule to contain only text, and not other instances of the “content” rule. This rules out some ambiguities, such as whether “$x$ y $z$” is a nested mathmode or not.

We may not need the counting powers of context-free grammars, yet EBNF is easier to manage than regular expressions. You can apply context-sensitive rules to matches, whereas with regexes that would require coordination between separate passes. The order of operations is less sensitive; because the parser generator knows about all patterns you want to match in advance, it will match longer terminals before shorter—more ambiguous—terminals. And if we wanted to do operations on all four kinds of math mode, this allows us to do so without complicated chains of regular expressions.

The history of parsers is long and storied, and the theory of generating parsing programs from specifications like EBNF is basically considered a solved problem. However, there are lot of parser generators out there. And, like regular expression engines, they each have their own flavor of EBNF—or, as is more popular nowadays, they have you write your EBNF using the features of the language the parser generator is written in. And finally, a downside of using a parser generator is that you have to then write a program to operate on the parsed representation (which also differs by implementation).

We’ll demonstrate this process by using a Python library that, in my opinion, stays pretty faithful to the EBNF heritage. It’s called lark and you can pip-install it as

pip install lark-parser

Note: the hard-core industry standard parser generators are antlr, lex, and yacc. I would not recommend them for small parsing jobs, but if you’re going to do this as part of a company, they are weighty, weathered, well-documented—unlike lark.

Lark is used entirely inside python, and you specify the EBNF-like grammar as a string. For example, ours is

tex: content+

?content: mathmode | text+

        | INLINE text+ INLINE


?text: /./s

You can see the similarities with our “raw” EBNF. The main difference here is the use of + for matching “one or more” of a rule, the use of a regular expression to define the “text” rule as any character (here again the trailing “s” means: allow the dot character to match newline characters). The backslashes are needed because backslash is an escape character in Python. And finally, the question mark tells lark to try to compress the tree if it only matches one item (you can see what the difference is by playing with our display-parsed-tree.py script that shows the parsed representation of the input document. You can read more in lark’s documentation about what the structure of the parsed tree is as python objects (Tree for rule/terminal matches and Token for individual characters).

For the input “Let $x=0$”, the parsed tree is as follows (note that the ? makes lark collapse the many “text” matches):

    [Token(__ANON_0, 'L'), 
     Token(__ANON_0, 'e'), 
     Token(__ANON_0, 't'), 
     Token(__ANON_0, ' ')]), 
    [Token(INLINE, '$'), 
     Token(__ANON_0, 'x'), 
     Token(__ANON_0, '='), 
     Token(__ANON_0, '0'), 
     Token(INLINE, '$')]), 
   Token(__ANON_0, '\n')])

So now we can write a simple python program that traverses this tree and converts the delimiters. The entire program is on Github, but the core is

def join_tokens(tokens):
    return ''.join(x.value for x in tokens)

def handle_mathmode(tree_node):
    '''Switch on the different types of math mode, and convert
       the delimiters to the desired output, and then concatenate the
       text between.'''
    starting_delimiter = tree_node.children[0].type

    if starting_delimiter in ['INLINE', 'INLINEOPEN']:
        return '\\(' + join_tokens(tree_node.children[1:-1]) + '\\)'
    elif starting_delimiter in ['OFFSETDOLLAR', 'OFFSETOPEN']:
        return '\\[' + join_tokens(tree_node.children[1:-1]) + '\\]'
        raise Exception("Unsupported mathmode type %s" % starting_delimiter)

def handle_content(tree_node):
    '''Each child is a Token node whose text we'd like to concatenate
    return join_tokens(tree_node.children)

The rest of the program uses lark to create the parser, reads the file from standard input, processes the parsed representation, and outputs the converted document to standard output. You can use the program like this:

python convert-delimiters.py < input.tex > output.tex

Exercise: extend this grammar to support literal dollar signs using \$, and passes them through to the output document unchanged.

What’s better?

I personally prefer regular expressions when the job is quick. If my text manipulation rule fits on one line, or can be expressed without requiring “look ahead” or “look behind” rules, regex is a winner. It’s also a winner when I only expect it to fail in a few exceptional cases that can easily be detected and fixed by hand. It’s faster to write a scrappy regex, and then open the output in a text editor and manually fix one or two mishaps, than it is to write a parser.

However, the longer I spend on a regular expression problem—and the more frustrated I get wrestling with it—the more I start to think I should have used a parser all along. This is especially true when dealing with massive jobs. Such as converting delimiters in hundreds of blog articles, each thousands of words long, or making changes across all chapter files of a book.

When I need something in between rigid structure and quick-and-dirty, I actually turn to vim. Vim has this fantastic philosophy of “act, repeat, rewind” wherein you find an edit that applies to the thing you want to change, then you search for the next occurrence of the start, try to apply the change again, visually confirm it does the right thing, and if not go back and correct it manually. Learning vim is a major endeavor (for me it feels lifelong, as I’m always learning new things), but since I spend most of my working hours editing structured text the investment and philosophy has paid off.

Until next time!