Recently my employer (Google) forced me to switch to Mercurial instead of my usual version control system, git. The process of switching sparked a few discussions between me and my colleagues about the value of various version control systems. A question like “what benefit does git provide over Mercurial” yielded no clear answers, suggesting many developers don’t know. An informal Twitter survey didn’t refute this claim.
A distinguished value git provides me is the ability to sculpt code changes into stories. It does this by
- Allowing changes to be as granular as possible.
- Providing good tools for manipulating changes.
- Treating the change history as a first-class object that can be manipulated.
- Making branching cheap, simple, and transparent.
This might be summed up by some as “git lets you rewrite history,” but to me it’s much more. Working with code is nonlinear by nature, which makes changes hard to communicate well. Wielding git well lets me easily draft, edit, decompose, mix, and recombine changes with ease. Thus, I can narrate a large change in a way that’s easy to review, reduces review cycle time, and makes hindsight clear. Each pull request is a play, each commit a scene, each hunk a line of dialogue or stage direction, and git is a director’s tool.
The rest of this article will expand on these ideas. For those interested in learning more about git from a technical perspective, I enjoyed John Wiegley’s Git from the Ground Up. That short book will make rigorous many of the terms I will use more loosely in this article, but basic git familiarity is all that’s required to understand the gist of this article. My philosophy of using git is surely unoriginal. It is no doubt influenced by the engineers I’ve worked with and the things I’ve read. At best, my thoughts here refine and incrementally expand upon what I’ve picked up from others.
If Lisp’s great insight was that code is data that programmers can take advantage of that with metaprogramming, then git’s great insight is that code changes are data and programmers can take advantage of that with metachanges. Changes are the data you produce while working on features and bug fixes. Metachanges are the changes you make to your changes to ready them for review. Embracing metachanges enables better cleanliness and clearer communication. Git supports metachanges with few limits, and without sacrificing flexibility.
For instance, if you want, you can treat git like a replacement for Dropbox. You keep a single default branch, you do
git pull, edit code, and run
git add --all && git commit -m "do stuff" && git push. This saves all your work and pushes it to the server. You could even alias this to
git save. I admit that I basically do this for projects of no real importance.
Such sloppy usage violates my personal philosophy of git. That is, you should tell a clear story with your commits. It’s not just that a commit message like “do stuff” is useless, it’s that the entire unit of work is smashed into one change and can’t easily be teased apart or understood incrementally.
This is problematic for code review, which is a crucial part of software development. The cost and cognitive burden of a unit of code review scales superlinearly with the amount of code to review (citation needed, but this is my personal experience). However, sometimes large code reviews are necessary. Large refactors, extensive testing scenarios, and complex features often cannot be split into distinct changesets or pull requests. In addition, most continuous integration frameworks require that after every merge of a changeset or pull request, all tests pass and the product is deployable. That means you can’t submit an isolated changeset that causes tests to fail or performs partial refactors without doing more work and introducing more opportunities to make mistakes.
In light of this, I want to reduce the review burden for my reviewers, and encourage people I’m reviewing for to reduce the burden for me. This is a human toll. The best way to help someone understand a complex change is to break it into the smallest possible reasonable, meaningful units of change, and then compose the pieces together in a logical way. That is, to tell a story.
Git enables this by distinguishing between units of change. The most atomic unit of change is a hunk, which is a diff of a subset of lines of a file. A set of hunks (possibly across many files) are assembled into a commit, which includes an arbitrary message that describes the commit. Commits are assembled linearly into a branch which can then be merged with the commits in another branch, often the master/default/golden branch. (Technically commits are nodes in a tree, each commit having a unique parent, and a branch is just a reference to a commit, but the spirit of what I said is correct.) On GitHub, the concept of a pull request wraps a list of commits on a branch with a review and approval process, where it shows you the list of commits in order.
Take advantage of these three levels of specificity: use hunks to arrange your thoughts, commits to voice a command, and pull requests to direct the ensemble.
In particular, as a feature implementer, you can reduce review burden by separating the various concerns composing a feature into different commits. The pull request as a whole might consist of the core feature work, tests, some incidental cleanup-as-you-go, and some opportunistic refactoring. These can each go in different commits, and the core feature work usually comprises much less than half the total code to review.
Splitting even the core feature work into smaller commits makes reviewing much easier. For example, your commits for a feature might suggestively look like the following (where the top is the first commit and the bottom is the last commit):
9d7a191 - read the user's full name from the database
7d5c212 - unit tests for user name reading
cdb37c5 - include user's name in the user/ API response
7c5c62c - unit tests for user/ API name field
7b4ca44 - display the user's name on the profile page
8e72535 - integration test to verify name displayed
9bdf5b8 - sanitize the name field on submission
e11201b - unit tests for name submission
341abdc - refactor name -> full_name
331bcb2 - style fixes
Each unit is small and the reviewer reads the commits one at a time, in the order you present them. Then the reviewer approves them all as a whole or asks for revisions. This style results in faster reviews than if all the code is included in one commit because the reviewer need not reconstruct your story from scratch. Code is inherently laid out nonlinearly, so it’s hard to control what the reviewer sees and in what order. By crafting your pull request well, you draw attention to certain aspects of the change before showing its otherwise confusing implications. A story commit style is a natural way to achieve this.
This is the key benefit of the story approach. You get better control over how your code reviewer reads your code. This reduces cognitive burden for the reviewer and increases the likelihood that bugs are found during review.
There are also less obvious benefits that can have an outsized impact. Explaining your work as a story prompts you to think critically about your changes, suggesting redesigns and helping you catch errors using the same principle behind rubber duck debugging. Moreover, by revealing your thought process, the reviewer understands you better and can suggest better improvements. If all your code is lumped into one change, it’s easy for a reviewer to second guess the rationale behind a particular line of code. If that line is intentional and included in a small commit with a message that makes it clear it was intended (and for what reason), the question is preemptively answered and a bit of trust is built. Finally, the more you practice organizing your work as a clean story, the easier it is for your work to actually become that clean story. You learn to quickly assemble more efficient plans to get work done. You end up revising your work less often, or at least less often for stupid reasons.
By its design and tooling, git makes crafting narrative code changes easy. The tools that enable this are the staging area (a.k.a. the index), branches, cherry-picking, and interactive rebasing.
The staging area (or index) is a feature that allows you to mark parts of the changes in your workspace as “to be included in the next commit.” In other words, git has three possible states of a change: committed, staged for the next commit, and not yet staged. By default
git diff shows only what has not been staged, and
git diff --staged allows you to see only what’s staged, but not committed.
I find the staging area to be incredibly powerful for sorting out my partial work messes. I make messes because I don’t always know in advance if some tentative change is what I really want. I often have to make some additional changes to see how it plays out, and if I run into any roadblocks, how I would resolve them.
The staging area helps me be flexible: rather than commit the changes and undo them later (see next paragraph), I can experiment with the mess, get it to a good place to start committing, and then repeatedly stage the first subset of changes I want to group into a commit,
git commit, and keep staging. This is seamless with editor support. I use vim-gitgutter, which allows me to simply navigate to the block I want to stage, type
,ha (“leader hunk add”), continue until I’m ready to commit, then drop to the command line and run
git commit. Recall, the “hunk” is the smallest unit of change git supports (a minimal subset of lines changing a single file), and the three layer “hunk-commit-pull” hierarchy of changes provides three layers of commitment support: hunks are what I organize into a commit (what I am ready to “commit” to being included in the feature I’m working on), commits are minimal semantic units of change that can be comprehended by a reviewer in the context of the larger feature, and a pull request is the smallest semantic unit of “approvable work” (feature work that maintains repository-wide continuous integration invariants).
Of course, I can’t always be right. Sometimes I will make some commits and realize I want to go back and do it differently. This is where branching, rebasing, and cherry-picking come in. The simplest case is when I made a mistake in something I committed. Then I can do an interactive rebase to basically go back to the point in time when I made the mistake, correct it, and go back to the present. Alternatively, I can fix the change now, commit it, and use interactive rebasing to combine the two commits post hoc. Provided you don’t run into any merge conflicts between these commits and other commits in your branch, this is seamless. I can also leave unstaged things like extra logging, notes, and debug code, or commit them with a magic string, and run a script before pushing that removes all commits from my branch whose message contains the magic string.
Another kind of failure is when I realize—after finishing half the work—that I can split the feature work into two smaller approvable units. In that case, I can extract the commits from my current branch to a new branch in a variety of ways (branch+rebase or cherry-pick), and then prepare the two separate (or dependent) pull requests.
This is impossible without keeping a fine-grained commit history while you’re developing. Otherwise you have to go back and manually split commits, which is more time consuming. Incidentally, this is a pain point of mine when using Perforce and Mercurial. In those systems the “commit” is the smallest approvable unit, which you can amend by including all local changes or none. While they provide some support for splitting bigger changes into smaller changes post hoc, I’ve yet to do this confidently and easily, because you have to go down from the entire change back to hunks. Git commits group hunks into semantically meaningful (and named) units that go together when reorganized. In my view, when others say a benefit of git is that “branches are cheap,” simple and fast reorganization is the true benefit. “Cheap branching” is a means to that end.
A third kind of mistake is one missed even by review. Such errors make it to the master branch, and start causing bugs in production. Having a clean commit history with small commits is helpful here because it allows one to easily rolled back the minimal bad change without rolling back the entire pull request it was contained in (though you can if you want).
Finally, the benefits of easy review are redoubled when looking at project history outside the context of a review. You can point new engineers who want to implement a feature to a similar previous feature, and rather than have them see all your code lumped into one blob, they can see the evolution of the feature in its cleanest and clearest form. Best practices, like testing and refactoring and when to make abstractions, are included in the story.
There is a famous quote by Hal Abelson, that “Programs must be written for people to read, and only incidentally for machines to execute.” The same view guides my philosophy for working with revisions, that code changes must be written for people to read and incidentally to change codebases. Now that you’ve had a nice groan, let me ask you to reflect on this hyperbole the next time you encounter a confusing code review.