The Communicative Value of Using Git Well

Recently my employer (Google) forced me to switch to Mercurial instead of my usual version control system, git. The process of switching sparked a few discussions between me and my colleagues about the value of various version control systems. A question like “what benefit does git provide over Mercurial” yielded no clear answers, suggesting many developers don’t know. An informal Twitter survey didn’t refute this claim.

A distinguished value git provides me is the ability to sculpt code changes into stories. It does this by

  1. Allowing changes to be as granular as possible.
  2. Providing good tools for manipulating changes.
  3. Treating the change history as a first-class object that can be manipulated.
  4. Making branching cheap, simple, and transparent.

This might be summed up by some as “git lets you rewrite history,” but to me it’s much more. Working with code is nonlinear by nature, which makes changes hard to communicate well. Wielding git well lets me easily draft, edit, decompose, mix, and recombine changes with ease. Thus, I can narrate a large change in a way that’s easy to review, reduces review cycle time, and makes hindsight clear. Each pull request is a play, each commit a scene, each hunk a line of dialogue or stage direction, and git is a director’s tool.

The rest of this article will expand on these ideas. For those interested in learning more about git from a technical perspective, I enjoyed John Wiegley’s Git from the Ground Up. That short book will make rigorous many of the terms I will use more loosely in this article, but basic git familiarity is all that’s required to understand the gist of this article. My philosophy of using git is surely unoriginal. It is no doubt influenced by the engineers I’ve worked with and the things I’ve read. At best, my thoughts here refine and incrementally expand upon what I’ve picked up from others.

If Lisp’s great insight was that code is data that programmers can take advantage of that with metaprogramming, then git’s great insight is that code changes are data and programmers can take advantage of that with metachanges. Changes are the data you produce while working on features and bug fixes. Metachanges are the changes you make to your changes to ready them for review. Embracing metachanges enables better cleanliness and clearer communication. Git supports metachanges with few limits, and without sacrificing flexibility.

For instance, if you want, you can treat git like a replacement for Dropbox. You keep a single default branch, you do git pull, edit code, and run git add --all && git commit -m "do stuff" && git push. This saves all your work and pushes it to the server. You could even alias this to git save. I admit that I basically do this for projects of no real importance.

Such sloppy usage violates my personal philosophy of git. That is, you should tell a clear story with your commits. It’s not just that a commit message like “do stuff” is useless, it’s that the entire unit of work is smashed into one change and can’t easily be teased apart or understood incrementally.

This is problematic for code review, which is a crucial part of software development. The cost and cognitive burden of a unit of code review scales superlinearly with the amount of code to review (citation needed, but this is my personal experience). However, sometimes large code reviews are necessary. Large refactors, extensive testing scenarios, and complex features often cannot be split into distinct changesets or pull requests. In addition, most continuous integration frameworks require that after every merge of a changeset or pull request, all tests pass and the product is deployable. That means you can’t submit an isolated changeset that causes tests to fail or performs partial refactors without doing more work and introducing more opportunities to make mistakes.

In light of this, I want to reduce the review burden for my reviewers, and encourage people I’m reviewing for to reduce the burden for me. This is a human toll. The best way to help someone understand a complex change is to break it into the smallest possible reasonable, meaningful units of change, and then compose the pieces together in a logical way. That is, to tell a story.

Git enables this by distinguishing between units of change. The most atomic unit of change is a hunk, which is a diff of a subset of lines of a file. A set of hunks (possibly across many files) are assembled into a commit, which includes an arbitrary message that describes the commit. Commits are assembled linearly into a branch which can then be merged with the commits in another branch, often the master/default/golden branch. (Technically commits are nodes in a tree, each commit having a unique parent, and a branch is just a reference to a commit, but the spirit of what I said is correct.) On GitHub, the concept of a pull request wraps a list of commits on a branch with a review and approval process, where it shows you the list of commits in order.

Take advantage of these three levels of specificity: use hunks to arrange your thoughts, commits to voice a command, and pull requests to direct the ensemble.

In particular, as a feature implementer, you can reduce review burden by separating the various concerns composing a feature into different commits. The pull request as a whole might consist of the core feature work, tests, some incidental cleanup-as-you-go, and some opportunistic refactoring. These can each go in different commits, and the core feature work usually comprises much less than half the total code to review.

Splitting even the core feature work into smaller commits makes reviewing much easier. For example, your commits for a feature might suggestively look like the following (where the top is the first commit and the bottom is the last commit):

9d7a191 - read the user's full name from the database
7d5c212 - unit tests for user name reading
cdb37c5 - include user's name in the user/ API response
7c5c62c - unit tests for user/ API name field
7b4ca44 - display the user's name on the profile page
8e72535 - integration test to verify name displayed
9bdf5b8 - sanitize the name field on submission
e11201b - unit tests for name submission
341abdc - refactor name -> full_name
331bcb2 - style fixes

Each unit is small and the reviewer reads the commits one at a time, in the order you present them. Then the reviewer approves them all as a whole or asks for revisions. This style results in faster reviews than if all the code is included in one commit because the reviewer need not reconstruct your story from scratch. Code is inherently laid out nonlinearly, so it’s hard to control what the reviewer sees and in what order. By crafting your pull request well, you draw attention to certain aspects of the change before showing its otherwise confusing implications. A story commit style is a natural way to achieve this.

This is the key benefit of the story approach. You get better control over how your code reviewer reads your code. This reduces cognitive burden for the reviewer and increases the likelihood that bugs are found during review.

There are also less obvious benefits that can have an outsized impact. Explaining your work as a story prompts you to think critically about your changes, suggesting redesigns  and helping you catch errors using the same principle behind rubber duck debugging. Moreover, by revealing your thought process, the reviewer understands you better and can suggest better improvements. If all your code is lumped into one change, it’s easy for a reviewer to second guess the rationale behind a particular line of code. If that line is intentional and included in a small commit with a message that makes it clear it was intended (and for what reason), the question is preemptively answered and a bit of trust is built. Finally, the more you practice organizing your work as a clean story, the easier it is for your work to actually become that clean story. You learn to quickly assemble more efficient plans to get work done. You end up revising your work less often, or at least less often for stupid reasons.

By its design and tooling, git makes crafting narrative code changes easy. The tools that enable this are the staging area (a.k.a. the index), branches, cherry-picking, and interactive rebasing.

The staging area (or index) is a feature that allows you to mark parts of the changes in your workspace as “to be included in the next commit.” In other words, git has three possible states of a change: committed, staged for the next commit, and not yet staged. By default git diff shows only what has not been staged, and git diff --staged allows you to see only what’s staged, but not committed.

I find the staging area to be incredibly powerful for sorting out my partial work messes. I make messes because I don’t always know in advance if some tentative change is what I really want. I often have to make some additional changes to see how it plays out, and if I run into any roadblocks, how I would resolve them.

The staging area helps me be flexible: rather than commit the changes and undo them later (see next paragraph), I can experiment with the mess, get it to a good place to start committing, and then repeatedly stage the first subset of changes I want to group into a commit, git commit, and keep staging. This is seamless with editor support. I use vim-gitgutter, which allows me to simply navigate to the block I want to stage, type ,ha (“leader hunk add”), continue until I’m ready to commit, then drop to the command line and run git commit. Recall, the “hunk” is the smallest unit of change git supports (a minimal subset of lines changing a single file), and the three layer “hunk-commit-pull” hierarchy of changes provides three layers of commitment support: hunks are what I organize into a commit (what I am ready to “commit” to being included in the feature I’m working on), commits are minimal semantic units of change that can be comprehended by a reviewer in the context of the larger feature, and a pull request is the smallest semantic unit of “approvable work” (feature work that maintains repository-wide continuous integration invariants).

Of course, I can’t always be right. Sometimes I will make some commits and realize I want to go back and do it differently. This is where branching, rebasing, and cherry-picking come in. The simplest case is when I made a mistake in something I committed. Then I can do an interactive rebase to basically go back to the point in time when I made the mistake, correct it, and go back to the present. Alternatively, I can fix the change now, commit it, and use interactive rebasing to combine the two commits post hoc. Provided you don’t run into any merge conflicts between these commits and other commits in your branch, this is seamless. I can also leave unstaged things like extra logging, notes, and debug code, or commit them with a magic string, and run a script before pushing that removes all commits from my branch whose message contains the magic string.

Another kind of failure is when I realize—after finishing half the work—that I can split the feature work into two smaller approvable units. In that case, I can extract the commits from my current branch to a new branch in a variety of ways (branch+rebase or cherry-pick), and then prepare the two separate (or dependent) pull requests.

This is impossible without keeping a fine-grained commit history while you’re developing. Otherwise you have to go back and manually split commits, which is more time consuming. Incidentally, this is a pain point of mine when using Perforce and Mercurial. In those systems the “commit” is the smallest approvable unit, which you can amend by including all local changes or none. While they provide some support for splitting bigger changes into smaller changes post hoc, I’ve yet to do this confidently and easily, because you have to go down from the entire change back to hunks. Git commits group hunks into semantically meaningful (and named) units that go together when reorganized. In my view, when others say a benefit of git is that “branches are cheap,” simple and fast reorganization is the true benefit. “Cheap branching” is a means to that end.

A third kind of mistake is one missed even by review. Such errors make it to the master branch, and start causing bugs in production. Having a clean commit history with small commits is helpful here because it allows one to easily rolled back the minimal bad change without rolling back the entire pull request it was contained in (though you can if you want).

Finally, the benefits of easy review are redoubled when looking at project history outside the context of a review. You can point new engineers who want to implement a feature to a similar previous feature, and rather than have them see all your code lumped into one blob, they can see the evolution of the feature in its cleanest and clearest form. Best practices, like testing and refactoring and when to make abstractions, are included in the story.

There is a famous quote by Hal Abelson, that “Programs must be written for people to read, and only incidentally for machines to execute.” The same view guides my philosophy for working with revisions, that code changes must be written for people to read and incidentally to change codebases. Now that you’ve had a nice groan, let me ask you to reflect on this hyperbole the next time you encounter a confusing code review.

A Good Year for “A Programmer’s Introduction to Mathematics”

A year ago today I self-published “A Programmer’s Introduction to Mathematics” (PIM). In this short note I want to describe the success it’s had, summarize the complaints of some readers and the praise of others, and outline what’s next.

Since publication PIM has sold over 11,000 copies.

Paperback and Ebook.png

A rough chart showing the sales of paperback and ebook copies of PIM.

Here’s a table:

Month Paperback Ebook Total
2018-12 4,323 1,866 6,189
2019-01 1,418 258 1,676
2019-02 852 128 980
2019-03 762 107 869
2019-04 357 63 420
2019-05 289 51 340
2019-06 200 58 258
2019-07 223 40 263
2019-08 158 43 201
2019-09 159 24 183
2019-10 155 44 199

As expected, the sales dropped significantly after the first month, but leveled out at around 200 copies per month. With self-publishing royalties are generally high, and in the end I make a minimum of $15 on each sale. I’m thrilled. The book has been successful well beyond my expectations. I remember joking with friends that I’d call the project a failure if I didn’t sell a thousand copies, and my wife joking that if I sold twenty thousand I should quit my job. Reality found a happy middle ground.

Outside of my normal activity online, a pre-publication mailing list, and a $400/month cap on a Google Ads campaign, I’ve done little to promote the book. I don’t track conversion rates (pimbook.org is javascript-free). Though I can’t control what Amazon and Google collect on my behalf, I don’t look at it. I have an ad on the sidebar of my blog, which I suspect drives most of the sales via organic search results for various math topics.

Combining all the costs together, from Google Ads, Gumroad premium, tech review, and sending signed copies to Patreon patrons—but not including my time—a year of sales has cost about $4,500.

Not all readers enjoyed the book. As the Amazon reviewer “SCB” put it, “the book feels more like a refresher than an introduction.” User “zach” concurred, calling it “not a kind introduction,” and user “MysteryGuest” said its “difficulty ramps up quickly.” The other two emphatic criticisms I heard were that I did not include solutions for the exercises, nor did I provide a version of the ebook in standard ebook formats.

I won’t waste your time here rebutting critics, and I broadly agree with their claims. The difficulty does ramp, I didn’t provide solutions, and I didn’t make a Kindle version. I have no plans to change that as of now.

On the other hand, many readers enjoyed the book. Amazon reviewer “B.W.” remarked, “First time I’ve ever read about BigO and actually had it ‘click’.” User “Scuba Enthusiast” said the book “conveys the cultural assumptions, conventions, and notations in a way that someone familiar with programming is able to ‘grok’.”

Many readers have also told me via less public means they enjoyed the book. A colleague of mine at Google—who is an excellent engineer but not all that interested in math—had the need to dig into some mathematical notation describing an integer linear program his team was responsible for. He said reading the first few chapters of my book helped him decipher the symbols and ultimately understand the model. Later he told me he had started to design his own optimization models.

I was also delighted to read Tim Urian’s thoughts, which he recently posted a public Google doc. He detailed his struggles with reading the book, and how he overcame those struggles.

Then there was a public form I put up for submitting errata. Users submitted over 200 errors, comments, and clarifications. I’ve addressed them for the second edition, and I’m glad to say most mathematical errors are not substantial, and I recognize when I wrote one thing, later changed my mind about some detail (like indexing from zero or one), and did not adequately update what was already written.

Speaking of the second edition, I’ll be releasing a second edition early next year. In addition to the errata, I’ve spruced up a number of the exercises, improved explanations of some things, added helpful reminders in spots readers often got confused, and I’m working on a new short chapter about proofs and logical foundations. Chances are good that it’s not enough to justify buying a second copy, but I think after the second edition I’ll feel comfortable calling the book “done” for the foreseeable future. In the mean time, I’ve made the ebook for the first edition “pay what you want,” so you don’t feel cheated if you buy the ebook the day before the second edition comes out. I will probably leave the first edition ebook free after the second edition comes out.

I’m also very slowly working on the converse book, tentatively titled “A Mathematician’s Introduction to Programming.”

On a related note, I’m sad to say I haven’t taken more time to write blog posts. As my responsibilities at work have grown, I’ve become more swamped and been less motivated. That’s also been coupled with some blog projects I have tried but failed to make meaningful progress on, and I feel bad abandoning them because I feel there is no principled reason I can’t finish them.

Postscript: I’ve been asked a few times about the rights to foreign translations and distributions of the book. I have no experience in that, and I’ve basically been ignoring all incoming requests. If anyone has advice about how to navigate this, and whether it’s worth the effort for a solo operation like my own, I’d be open to hear it.

Silent Duels—Constructing the Solution part 1

Previous posts in this series:

Silent Duels and an Old Paper of Restrepo
Silent Duels—Parsing the Construction

Last time we waded into Restrepo’s silent duel paper. You can see the original and my re-typeset version on Github along with all of the code in this series. We digested Section 2 and a bit further, plotting some simplified examples of the symmetric version of the game.

I admit, this paper is not an easy read. Sections 3-6 present the central definitions, lemmas, and proofs. They’re a slog of dense notation with no guiding intuition. The symbols don’t even get catchy names or “think of this as” explanations! I think this disparity in communication has something to do with the period in which the paper was written, and something to do with the infancy of computing at the time. The paper was published in 1957, which marked the year IBM switched from vacuum tubes to transistors, long before Moore’s Law was even a twinkle in Gordon Moore’s eye. We can’t blame Restrepo for not appealing to our modern sensibilities.

I spent an embarrassing amount of time struggling through Sections 3-6 when I still didn’t really understand the form of the optimal strategy. It’s not until the very end of the paper (Section 8, the proof of Theorem 1) that we get a construction. See the last post for a detailed description of the data that constitutes the optimal strategy. In brief, it’s a partition of [0,1] into subintervals [a_i, a_{i+1}] with a probability distribution F_i on each interval, and the time you take your i-th action is chosen randomly according to F_i. Optionally, the last distribution can include a point-mass at time t=1, i.e., “wait until the last moment for a guaranteed hit.”

Section 8 describes how to choose the a_i‘s and b_j‘s, with the distributions F_i and G_j built according to the formulas built up in the previous sections.

Since our goal is still to understand how to construct the solution—even if why it works is still a mystery—we’ll write a program that implements this algorithm in two posts. First, we’ll work with a simplified symmetric game, where the answer is provided for us as a test case. In a followup post, we’ll rework the code to construct the generic solution, and pick nits about code quality, and the finer points of the algorithm Restrepo leaves out.

Ultimately, if what the program outputs matches up with Restrepo’s examples (in lieu of understanding enough of the paper to construct our own), we will declare victory—we’ll have successfully sanity-checked Restrepo’s construction. Then we can move on to studying why this solution works and what caveats hide beneath the math.

Follow the Formulas

The input to the game is a choice of n actions for player 1, m actions for player 2, and probability distributions P, Q for the two player’s success probabilities (respectively). Here’s the algorithm as stated by Restrepo, with the referenced equations following. If you’re following along with my “work through the paper organically” shtick, I recommend you try parsing the text below before reading on. Recall \alpha is the “wait until the end” probability for player 1’s last action, and \beta is the analogous probability for player 2.

restrepo-construction.png

Restrepo’s description of the algorithm for computing the optimal strategy.

Screen Shot 2019-02-09 at 5.18.17 PM

The equations referenced above

Let’s sort through this mess.

First, the broad picture. The whole construction depends on \alpha, \beta, these point masses for the players’ final actions. However, there’s this condition that \alpha \beta = 0, i.e., at most one can be nonzero. This makes some vaguely intuitive sense: a player with more actions will have extra “to spare,” and so it may make sense for them to wait until the very end to get a guaranteed hit. But only one player can have such an advantage over the other, so only one of the two parameters may be nonzero. That’s my informal justification for \alpha \beta = 0.

If we don’t know \alpha, \beta at the beginning, Restrepo’s construction (the choice of a_i‘s and b_j‘s) is a deterministic function of \alpha, \beta, and the other fixed inputs.

\displaystyle (\alpha, \beta) \mapsto (a_1, \dots, a_n, b_1, \dots, b_m)

The construction asserts that the optimal solution has a_1 = b_1 and we need to find an input (\alpha, \beta) \in [0,1) \times [0,1) such that \alpha \beta = 0 and they produce a_1 = b_1 = 1 as output. We’re doing a search for the “right” output parameters, and using knowledge about the chained relationship of equations to look at the output, and use it to tweak the input to get the output closer to what we want. It’s not gradient descent, but it could probably be rephrased that way.

In particular, consider the case when we get b_1 < a_1, and the other should be clear. Suppose that starting from \alpha = 0, \beta = 0 we construct all our a_i‘s and b_j‘s and get b_1 < a_1. Then we can try again with \alpha = 0, \beta = 1, but since \beta = 1 is illegal we’ll use \beta = 1 - \varepsilon for as small a \varepsilon as we need (to make the next sentence true). Restrepo claims that picking something close enough to 1 will reverse the output, i.e. will make a_1 < b_1. He then finishes with (my paraphrase), “obviously a_1, b_1 are continuous in terms of \beta, so a solution exists with a_1 = b_1 for some choice of \beta \neq 0; that’s the optimal solution.” Restrepo is relying on the intermediate value theorem from calculus, but to find that value the simplest option is binary search. We have the upper and lower bounds, \beta = 0 and \beta = 1 - \varepsilon, and we know when we found our target: when the output has a_1 = b_1.

This binary search will come back in full swing in the next post, since we already know that the symmetric silent duel starts with a_1 = b_1. No search for \alpha, \beta is needed, and we can fix them both to zero for now—or rather, assume the right values are known.

What remains is to determine how to compute the a_i‘s and b_j‘s from a starting \alpha, \beta. We’ll go through the algorithm step by step using the symmetric game where P=Q (same action success probability) and n=m (same action count) to ground our study. A followup post will revisit these formulas in full generality.

The symmetric game

The basic idea of the construction is that we start from a computation of the last action parameters a_n, b_m, and use those inductively to compute the parameters of earlier actions via a few integrals and substitutions. In other words, the construction is a recursion, and the interval in which the players take their last action [a_n, 1), [b_m, 1) is the base case. As I started writing the programs below, I wanted to give a name to these a_i, b_j values. Restrepo seems to refer to them as “parameters” in the paper. I call them transition times, since the mark the instants at which a player “transitions” from one action interval [a_2, a_3] to the next [a_3, a_4].

For a simple probability function P, the end of the algorithm results in equations similar to: choose a_t such that P(a_{n-t}) = 1/(2t + 3).

Recall, Player 1 has a special function used in each step to construct their optimal strategy, called f^* by Restrepo. It’s defined for non-symmetric game as follows, where recall Q is the opponent’s action probability:

\displaystyle f^*(t) = \prod_{b_j > t} (1-Q(b_j))\frac{Q'(t)}{Q^2(t)P(t)}

[Note the Q^2(t) is a product Q(t)Q(t); not an iterated function application.]

Here the b_j > t asks us to look at all the transition times computed in previous recursive steps, and compute the product of an action failure at those instants. This is the product \prod (1-Q(b_j)). This is multiplied by a mysterious fraction involving P, Q, which in the symmetric P=Q case reduces to P'(t) / P^3(t). In Python code, computing f^* is given below—called simply “f_star” because I don’t yet understand how to interpret it in a meaningful way. I chose to use SymPy to compute symbolic integrals, derivatives, and solve equations, so in the function below, prob_fun and prob_fun_var are SymPy expressions and lambdas.

from sympy import diff

def f_star(prob_fun, prob_fun_var, larger_transition_times):
    '''Compute f* as in Restrepo '57.

    In this implementation, we're working in the simplified example
    where P = Q (both players probabilities of success are the same).
    '''
    x = prob_fun_var
    P = prob_fun

    product = 1
    for a in larger_transition_times:
        product *= (1 - P(a))

    return product * diff(P(x), x) / P(x)**3

In this symmetric instance of the problem we already know that \alpha = \beta = 0 is the optimal solution (Restrepo states that in section 2), so we can fix \alpha=0, and compute a_n, which we do next.

In the paper, Restrepo says “compute without parameters in the definition of f^*” and this I take to mean, because there are no larger action instants, the product \prod_{b_j > t}1-P(b_j) is empty, i.e., we pass an empty list of larger_transition_times. Restrepo does violate this by occasionally referring to a_{n+1} = 1 and b_{m+1} = 1, but if we included either of these P(a_{n+1}) = P(1) = 0, and this would make the definition of f^* zero, which would produce a non-distribution, so that can’t be right. This is one of those cases where, when reading a math paper, you have to infer the interpretation that is most sensible, and give the author the benefit of the doubt.

Following the rest of the equations is trivial, except in that we are solving for a_n which is a limit of integration. Since SciPy works symbolically, however, we can simply tell it to integrate without telling it a_n, and ask it to solve for a_n.

from sympy import Integral
from sympy import Symbol
from sympy.solvers import solve
from sympy.functions.elementary.miscellaneous import Max

def compute_a_n(prob_fun, alpha=0):
    P = prob_fun
    t = Symbol('t0', positive=True)
    a_n = Symbol('a_n', positive=True)

    a_n_integral = Integral(
        ((1 + alpha) - (1 - alpha) * P(t)) * f_star(P, t, []), (t, a_n, 1))
    a_n_integrated = a_n_integral.doit()   # yes now "do it"
    P_a_n_solutions = solve(a_n_integrated - 2 * (1 - alpha), P(a_n))
    P_a_n = Max(*P_a_n_solutions)
    print("P(a_n) = %s" % P_a_n)

    a_n_solutions = solve(P(a_n) - P_a_n, a_n)
    a_n_solutions_in_range = [soln for soln in a_n_solutions if 0 &amp;lt; soln &amp;lt;= 1]
    assert len(a_n_solutions_in_range) == 1
    a_n = a_n_solutions_in_range[0]
    print("a_n = %s" % a_n)

    h_n_integral = Integral(f_star(P, t, []), (t, a_n, 1))
    h_n_integrated = h_n_integral.doit()
    h_n = (1 - alpha) / h_n_integrated
    print("h_n = %s" % h_n)

    return (a_n, h_n)

There are three phases here. First, we integrate and solve for a_n (blindly according to equation 27 in the paper). If you work out this integral by hand (expanding f^*), you’ll notice it looks like P'(t)/P^2(t), which suggests a natural substitution, u=P(t). After computing the integral (entering phase 2), we can maintain that substitution to first solve for P(a_n), say the output of that is some known number Z which in the code we call P_a_n, and then solve P(a_n) = z for a_n. Since that last equation can have multiple solutions, we pick the one between 0 and 1. Since P(t) must be increasing in [0,1], that guarantees uniqueness.

Note, we didn’t need to maintain the substitution in the integral, and perform a second solve. We could just tell sympy to solve directly for a_n, and it would solve P(a_n) = x in addition to computing the integral. But as I was writing, it was helpful for me to check my work in terms of the substitution. In the next post we’ll clean that code up a bit.

Finally, in the third phase we compute the h_n, which is a normalizing factor that ultimately ensures the probability distribution for the action in this interval has total probability mass 1.

The steps to compute the iterated lower action parameters (a_i for 1 \leq i < n) are similar, but the formulas are slightly different:

Note that the last action instant a_{i+1} and its normalizing constant h_{i+1} show up in the equation to compute a_i. In code, this is largely the same as the compute_a_n function, but in a loop. Along the way, we print some helpful diagnostics for demonstration purposes. These should end up as unit tests, but as I write the code for the first time I prefer to debug this way. I’m not even sure if I understand the construction well enough to do the math myself and write down unit tests that make sense; the first time I tried I misread the definition of f^* and I filled pages with confounding and confusing integrals!

from collections import deque

def compute_as_and_bs(prob_fun, n, alpha=0):
    '''Compute the a's and b's for the symmetric silent duel.'''
    P = prob_fun
    t = Symbol('t0', positive=True)

    a_n, h_n = compute_a_n(prob_fun, alpha=alpha)

    normalizing_constants = deque([h_n])
    transitions = deque([a_n])
    f_star_products = deque([1, 1 - P(a_n)])

    for step in range(n):
        # prepending new a's and h's to the front of the list
        last_a = transitions[0]
        last_h = normalizing_constants[0]
        next_a = Symbol('a', positive=True)

        next_a_integral = Integral(
            (1 - P(t)) * f_star(P, t, transitions), (t, next_a, last_a))
        next_a_integrated = next_a_integral.doit()
        # print("%s" % next_a_integrated)
        P_next_a_solutions = solve(next_a_integrated - 1 / last_h, P(next_a))
        print("P(a_{n-%d}) is one of %s" % (step + 1, P_next_a_solutions))
        P_next_a = Max(*P_next_a_solutions)

        next_a_solutions = solve(P(next_a) - P_next_a, next_a)
        next_a_solutions_in_range = [
            soln for soln in next_a_solutions if 0 &amp;lt; soln &amp;lt;= 1]
        assert len(next_a_solutions_in_range) == 1
        next_a_soln = next_a_solutions_in_range[0]
        print("a_{n-%d} = %s" % (step + 1, next_a_soln))

        next_h_integral = Integral(
            f_star(P, t, transitions), (t, next_a_soln, last_a))
        next_h = 1 / next_h_integral.doit()
        print("h_{n-%d} = %s" % (step + 1, next_h))

        print("dF_{n-%d} coeff = %s" % (step + 1, next_h * f_star_products[-1]))

        f_star_products.append(f_star_products[-1] * (1 - P_next_a))
        transitions.appendleft(next_a_soln)
        normalizing_constants.appendleft(next_h)

    return transitions

Finally, we can run it with the simplest possible probability function: P(t) = t

x = Symbol('x')
compute_as_and_bs(Lambda((x,), x), 3)

The output is

P(a_n) = 1/3
a_n = 1/3
h_n = 1/4
P(a_{n-1}) is one of [1/5]
a_{n-1} = 1/5
h_{n-1} = 3/16
dF_{n-1} coeff = 1/8
P(a_{n-2}) is one of [1/7]
a_{n-2} = 1/7
h_{n-2} = 5/32
dF_{n-2} coeff = 1/12
P(a_{n-3}) is one of [1/9]
a_{n-3} = 1/9
h_{n-3} = 35/256
dF_{n-3} coeff = 1/16

This matches up so far with Restrepo’s example, since P(a_{n-k}) = a_{n-k} = 1/(2k+3) gives 1/3, 1/5, 1/7, 1/9, \dots. Since we have the normalizing constants h_i, we can also verify the probability distribution for each action aligns with Restrepo’s example. The constant in the point mass function is supposed to be h_i \prod_{j={i+1}}^n (1-P(a_j)). This is what I printed out as dF_{n-k}. In Restrepo’s example, this is expected to be \frac{1}{4(k+1)}, which is exactly what’s printed out.

Another example using P(t) = t^3:

P(a_n) = 1/3
a_n = 3**(2/3)/3
h_n = 1/4
P(a_{n-1}) is one of [-1/3, 1/5]
a_{n-1} = 5**(2/3)/5
h_{n-1} = 3/16
dF_{n-1} coeff = 1/8
P(a_{n-2}) is one of [-1/5, 1/7]
a_{n-2} = 7**(2/3)/7
h_{n-2} = 5/32
dF_{n-2} coeff = 1/12
P(a_{n-3}) is one of [-1/7, 1/9]
a_{n-3} = 3**(1/3)/3
h_{n-3} = 35/256
dF_{n-3} coeff = 1/16

One thing to notice here is that the normalizing constants don’t appear to depend on the distribution. Is this a coincidence for this example, or a pattern? I’m not entirely sure.

Next time we’ll rewrite the code from this post so that it can be used to compute the generic (non-symmetric) solution, see what they can tell us, and from there we’ll start diving into the propositions and proofs.

Until next time!

Math Versus Dirty Data

At Google, our organization designs, owns, and maintains a number of optimization models that automate the planning of Google’s datacenter growth and health. As is pretty standard in supply chain optimization and planning, these models are often integer linear programs. It’s a core competency of operations research, after all.

One might think, “Large optimization problems? That sounds hard!” But it’s actually far from the hardest part of the job. In fact, it’s one of the few exciting parts of the job. The real hard part is getting data. Really, it’s that you get promised data that never materializes, and then you architect your system for features that rot before they ripen.

There’s a classic image of a human acting as if they’re throwing a ball for a dog, and the dog sprints off, only soon to realize the ball was never thrown. The ball is the promise of freshly maintained data, and recently I’ve been the dog.

When you don’t have good data, or you have data that’s bad in a known way, you can always try to design your model to accommodate for the deficiencies. As long as it’s clearly defined, it’s not beyond our reach. The math is fun and challenging, and I don’t want to shy away from it. My mathematician’s instinct pulls me left.

My instincts as an engineer pull me right: data issues will ultimately cause unexpected edge cases at the worst moment, and it will fall on me to spend all day debugging for a deadline tomorrow. Data issues lead to more complicated modeling features which further interact with other parts of the model and the broader system in confounding ways. Worst of all, it’s nearly impossible to signal problems to customers who depend on your output. When technical debt is baked into an optimization model as features, it makes explanation much harder. Accepting bad data also requires you write the code in a way that is easy to audit, since you need to audit literally everything. Transparency is good, but it’s tedious to do it generically well, and the returns are not worth it if the end result is, “well we can’t fix the data for two years anyway.”

Though a lot of this technical debt was introduced by predecessors who left the team, I’ve fallen for the mathematical siren’s call a few times. Go on, just add that slick new constraint. Just mask that misbehavior with a heuristic. It’s bit back hard and caused months of drag.

These days I’m swinging hard right on the pendulum. Delete half-implemented features that don’t have data to support them. Delete features that don’t have a clear connection to business needs (even if they work). Push back on new feature requests until the data exists. Require a point of contact and an SLO for any data you don’t own. Make speculative features easy to turn on/off (or remove!) without having to redesign the architecture. If it can’t be made easy to remove, don’t add it until you’re sure it will survive.

If you can’t evade bad data, err on the side of strict initial validation, and doing nothing or gracefully degrading service when validation fails. Expose the failures as alerts to the people who own the data (not you), and give the data owners a tool that repeats the validation logic in your system verbatim, so there is no confusion on the criteria for success. When you have this view, almost all of the complexity in your system lies in enabling this generic auditing, alerting, and managing of intricate (but ultimately arbitrary) policy.

I like to joke that I don’t have data-intensive applications or problems of scale, but rather policy-intensive applications. I haven’t found much insight from other software engineers about how to design and maintain policy-intensive software. Let me know if you have some! The obvious first step is to turn policy code into data. To the extent that we’ve done this, I adore that aspect of our systems. Still, you can’t avoid it when policies need to be encoded in an optimization model.

I do get sad that so much of my time is spent poop-smithing, as I like to say, even though we’re gradually getting better. Our systems need maintenance and care, and strong principles to keep the thicket from overwhelming us. For one, I track net lines of code added, with the goal to have it be net negative month over month, new features and all. We’ve kept it up for six months so far. Even our fixit week this week seems unnecessary, given how well our team has internalized paying off technical debt.

Though I do wonder what it’s all for. So Google can funnel the money it saves on datacenter costs into informing people the Earth is flat cat videos? If I didn’t have two particular internal side projects to look forward to—they involve topics I’m very interested in—I’d be bored, and I might succumb to jaded feelings, and I’d need a change. Certain perks and particularly enjoyable colleagues help avoid that. But still, I rarely have time to work on the stimulating projects, and even the teammates I’ve been delegating to often defer it to other priorities.

We let dirty data interfere with our design and architecture, now we’re paying back all that technical debt, and as a consequence there’s no time for our human flourishing.

I should open a math cafe.