Recently my employer (Google) forced me to switch to Mercurial instead of my usual version control system, git. The process of switching sparked a few discussions between me and my colleagues about the value of various version control systems. A question like “what benefit does git provide over Mercurial” yielded no clear answers, suggesting many developers don’t know. An informal Twitter survey didn’t refute this claim.
A distinguished value git provides me is the ability to sculpt code changes into stories. It does this by
- Allowing changes to be as granular as possible.
- Providing good tools for manipulating changes.
- Treating the change history as a first-class object that can be manipulated.
- Making branching cheap, simple, and transparent.
This might be summed up by some as “git lets you rewrite history,” but to me it’s much more. Working with code is nonlinear by nature, which makes changes hard to communicate well. Wielding git well lets me easily draft, edit, decompose, mix, and recombine changes with ease. Thus, I can narrate a large change in a way that’s easy to review, reduces review cycle time, and makes hindsight clear. Each pull request is a play, each commit a scene, each hunk a line of dialogue or stage direction, and git is a director’s tool.
The rest of this article will expand on these ideas. For those interested in learning more about git from a technical perspective, I enjoyed John Wiegley’s Git from the Ground Up. That short book will make rigorous many of the terms I will use more loosely in this article, but basic git familiarity is all that’s required to understand the gist of this article. My philosophy of using git is surely unoriginal. It is no doubt influenced by the engineers I’ve worked with and the things I’ve read. At best, my thoughts here refine and incrementally expand upon what I’ve picked up from others.
If Lisp’s great insight was that code is data that programmers can take advantage of that with metaprogramming, then git’s great insight is that code changes are data and programmers can take advantage of that with metachanges. Changes are the data you produce while working on features and bug fixes. Metachanges are the changes you make to your changes to ready them for review. Embracing metachanges enables better cleanliness and clearer communication. Git supports metachanges with few limits, and without sacrificing flexibility.
For instance, if you want, you can treat git like a replacement for Dropbox. You keep a single default branch, you do
git pull, edit code, and run
git add --all && git commit -m "do stuff" && git push. This saves all your work and pushes it to the server. You could even alias this to
git save. I admit that I basically do this for projects of no real importance.
Such sloppy usage violates my personal philosophy of git. That is, you should tell a clear story with your commits. It’s not just that a commit message like “do stuff” is useless, it’s that the entire unit of work is smashed into one change and can’t easily be teased apart or understood incrementally.
This is problematic for code review, which is a crucial part of software development. The cost and cognitive burden of a unit of code review scales superlinearly with the amount of code to review (citation needed, but this is my personal experience). However, sometimes large code reviews are necessary. Large refactors, extensive testing scenarios, and complex features often cannot be split into distinct changesets or pull requests. In addition, most continuous integration frameworks require that after every merge of a changeset or pull request, all tests pass and the product is deployable. That means you can’t submit an isolated changeset that causes tests to fail or performs partial refactors without doing more work and introducing more opportunities to make mistakes.
In light of this, I want to reduce the review burden for my reviewers, and encourage people I’m reviewing for to reduce the burden for me. This is a human toll. The best way to help someone understand a complex change is to break it into the smallest possible reasonable, meaningful units of change, and then compose the pieces together in a logical way. That is, to tell a story.
Git enables this by distinguishing between units of change. The most atomic unit of change is a hunk, which is a diff of a subset of lines of a file. A set of hunks (possibly across many files) are assembled into a commit, which includes an arbitrary message that describes the commit. Commits are assembled linearly into a branch which can then be merged with the commits in another branch, often the master/default/golden branch. (Technically commits are nodes in a tree, each commit having a unique parent, and a branch is just a reference to a commit, but the spirit of what I said is correct.) On GitHub, the concept of a pull request wraps a list of commits on a branch with a review and approval process, where it shows you the list of commits in order.
Take advantage of these three levels of specificity: use hunks to arrange your thoughts, commits to voice a command, and pull requests to direct the ensemble.
In particular, as a feature implementer, you can reduce review burden by separating the various concerns composing a feature into different commits. The pull request as a whole might consist of the core feature work, tests, some incidental cleanup-as-you-go, and some opportunistic refactoring. These can each go in different commits, and the core feature work usually comprises much less than half the total code to review.
Splitting even the core feature work into smaller commits makes reviewing much easier. For example, your commits for a feature might suggestively look like the following (where the top is the first commit and the bottom is the last commit):
9d7a191 - read the user's full name from the database
7d5c212 - unit tests for user name reading
cdb37c5 - include user's name in the user/ API response
7c5c62c - unit tests for user/ API name field
7b4ca44 - display the user's name on the profile page
8e72535 - integration test to verify name displayed
9bdf5b8 - sanitize the name field on submission
e11201b - unit tests for name submission
341abdc - refactor name -> full_name
331bcb2 - style fixes
Each unit is small and the reviewer reads the commits one at a time, in the order you present them. Then the reviewer approves them all as a whole or asks for revisions. This style results in faster reviews than if all the code is included in one commit because the reviewer need not reconstruct your story from scratch. Code is inherently laid out nonlinearly, so it’s hard to control what the reviewer sees and in what order. By crafting your pull request well, you draw attention to certain aspects of the change before showing its otherwise confusing implications. A story commit style is a natural way to achieve this.
This is the key benefit of the story approach. You get better control over how your code reviewer reads your code. This reduces cognitive burden for the reviewer and increases the likelihood that bugs are found during review.
There are also less obvious benefits that can have an outsized impact. Explaining your work as a story prompts you to think critically about your changes, suggesting redesigns and helping you catch errors using the same principle behind rubber duck debugging. Moreover, by revealing your thought process, the reviewer understands you better and can suggest better improvements. If all your code is lumped into one change, it’s easy for a reviewer to second guess the rationale behind a particular line of code. If that line is intentional and included in a small commit with a message that makes it clear it was intended (and for what reason), the question is preemptively answered and a bit of trust is built. Finally, the more you practice organizing your work as a clean story, the easier it is for your work to actually become that clean story. You learn to quickly assemble more efficient plans to get work done. You end up revising your work less often, or at least less often for stupid reasons.
By its design and tooling, git makes crafting narrative code changes easy. The tools that enable this are the staging area (a.k.a. the index), branches, cherry-picking, and interactive rebasing.
The staging area (or index) is a feature that allows you to mark parts of the changes in your workspace as “to be included in the next commit.” In other words, git has three possible states of a change: committed, staged for the next commit, and not yet staged. By default
git diff shows only what has not been staged, and
git diff --staged allows you to see only what’s staged, but not committed.
I find the staging area to be incredibly powerful for sorting out my partial work messes. I make messes because I don’t always know in advance if some tentative change is what I really want. I often have to make some additional changes to see how it plays out, and if I run into any roadblocks, how I would resolve them.
The staging area helps me be flexible: rather than commit the changes and undo them later (see next paragraph), I can experiment with the mess, get it to a good place to start committing, and then repeatedly stage the first subset of changes I want to group into a commit,
git commit, and keep staging. This is seamless with editor support. I use vim-gitgutter, which allows me to simply navigate to the block I want to stage, type
,ha (“leader hunk add”), continue until I’m ready to commit, then drop to the command line and run
git commit. Recall, the “hunk” is the smallest unit of change git supports (a minimal subset of lines changing a single file), and the three layer “hunk-commit-pull” hierarchy of changes provides three layers of commitment support: hunks are what I organize into a commit (what I am ready to “commit” to being included in the feature I’m working on), commits are minimal semantic units of change that can be comprehended by a reviewer in the context of the larger feature, and a pull request is the smallest semantic unit of “approvable work” (feature work that maintains repository-wide continuous integration invariants).
Of course, I can’t always be right. Sometimes I will make some commits and realize I want to go back and do it differently. This is where branching, rebasing, and cherry-picking come in. The simplest case is when I made a mistake in something I committed. Then I can do an interactive rebase to basically go back to the point in time when I made the mistake, correct it, and go back to the present. Alternatively, I can fix the change now, commit it, and use interactive rebasing to combine the two commits post hoc. Provided you don’t run into any merge conflicts between these commits and other commits in your branch, this is seamless. I can also leave unstaged things like extra logging, notes, and debug code, or commit them with a magic string, and run a script before pushing that removes all commits from my branch whose message contains the magic string.
Another kind of failure is when I realize—after finishing half the work—that I can split the feature work into two smaller approvable units. In that case, I can extract the commits from my current branch to a new branch in a variety of ways (branch+rebase or cherry-pick), and then prepare the two separate (or dependent) pull requests.
This is impossible without keeping a fine-grained commit history while you’re developing. Otherwise you have to go back and manually split commits, which is more time consuming. Incidentally, this is a pain point of mine when using Perforce and Mercurial. In those systems the “commit” is the smallest approvable unit, which you can amend by including all local changes or none. While they provide some support for splitting bigger changes into smaller changes post hoc, I’ve yet to do this confidently and easily, because you have to go down from the entire change back to hunks. Git commits group hunks into semantically meaningful (and named) units that go together when reorganized. In my view, when others say a benefit of git is that “branches are cheap,” simple and fast reorganization is the true benefit. “Cheap branching” is a means to that end.
A third kind of mistake is one missed even by review. Such errors make it to the master branch, and start causing bugs in production. Having a clean commit history with small commits is helpful here because it allows one to easily rolled back the minimal bad change without rolling back the entire pull request it was contained in (though you can if you want).
Finally, the benefits of easy review are redoubled when looking at project history outside the context of a review. You can point new engineers who want to implement a feature to a similar previous feature, and rather than have them see all your code lumped into one blob, they can see the evolution of the feature in its cleanest and clearest form. Best practices, like testing and refactoring and when to make abstractions, are included in the story.
There is a famous quote by Hal Abelson, that “Programs must be written for people to read, and only incidentally for machines to execute.” The same view guides my philosophy for working with revisions, that code changes must be written for people to read and incidentally to change codebases. Now that you’ve had a nice groan, let me ask you to reflect on this hyperbole the next time you encounter a confusing code review.
So, what benefit does git provide over Mercurial? I almost didn’t work with Mercurial, but from some googling it seems to be equivalent in power to git.
For one, Mercurial has no staging area. That removes one level of the three-level hierarchy from my toolset. It’s hard to identify exactly when in my workflow this causes issues, but I’ve started to notice it. For example, it’s not possible to commit a hunk from my editor like I can with git and vim-gitgutter.
Mercurial also collapses all changes within a pull request (changeset) into a single commit. That removes the meaningful difference between the top level (pull request) and the mid level (commit) that I find helpful to narrate. There is some ability when working locally to create a bunch of commits like I would in git, and then later squash them all using hg histedit. But my reviewers can’t see the individual commits, nor can they be seen or reverted individually in the long term project history.
There are some other issues I have, but I can’t yet tell if they’re because I’m not well practiced enough at Mercurial, or whether it’s a strict limitation of the tool.
> Mercurial also collapses all changes within a pull request (changeset) into a single commit.
This can’t possibly be true since there’s no concept of a pull request in either Git or Mercurial. How do you merge commits is up to you as the reviewed, and if anything, Git makes it easier to squash commits than Mercurial does.
Also, you seem to misunderstand what a changeset is in Mercurial. A changeset is a commit, it’s the same thing but a different name.
You’re right, this is an artifact of how Google does it. Git branches are a proxy for a changeset in practice.
Another thing. You mention the lack of the staging area, but it is a concept the lack of which is important for you since you come from the Git world and so you are accustomed to the Git-based workflows. Users of Mercurial use slightly different workflows (e.g. using hg commit -i) so that the lack of the staging area is not only not a big deal but also feels like a more natural things.
Of course, it’s a tautology that whatever you are familiar with “feels like a more natural thing.”
hg commit -i is an alternative, but have yet to find a vim plugin that supports it
(vim-signify has a closed issue on this topic). Moreover, the default curses interface on hg commit -i is pretty clunky. It doesn’t, e.g., let me see the context outside of the diff snippet (it can be hard to see what a hunk corresponds to semantically with only +/- 3 lines), or edit the diffs as I go.
I’ve been using mercurial daily for 6 months now and my opinion hasn’t changed much. Mercurial (and Google’s version of it, for additional reasons not listed here) makes it harder for me to submit small changes than git. I’m certainly not “confused” 🙂
Thanks for writing about this. I’ve also spent a bit of time thinking about how git can help us present our code in a way that’s easier for a human to read. I agree with you that giving structure to a big change by breaking it down into pieces helps hugely, and that git gives us some good tools to achieve this.
I’ve written up my ideas, under the name ‘Literate Git’, at https://github.com/bennorth/literate-git if you’re interested. The tool I wrote turns a structured git history into an interactive web page. There’s an example there of how the ideas might work in a tutorial setting. After I gave a talk on this work, one of the people in the audience tried it with the Haskell LLVM tutorial: https://lukelau.me/kaleidoscope/
I like your point that the act of structuring and explaining their work helps the developer to be more confident of its correctness and clarity. A similar idea to “you only properly understand something once you can explain it to someone else”. I also like the idea of temporary commits, identified by a magic string, which are then stripped.
A friend of mine has done a good bit of work in the area you identify of splitting a piece of work into smaller, independent pieces, looking particularly at how much of this can be automated. His work is at https://github.com/aspiers/git-deps
There is one difference between your workflow and mine. Where you use hunk / commit / branch to give a hierarchy, I use nested (sub-)branches to allow the equivalent of section / subsection / paragraph in a written paper.
(And at the risk of starting a religious war, I work in Emacs and use the excellent Magit interface to git within Emacs. It makes working with hunks, commits, and interactive rebases very easy. It sounds like your vim-based setup gives good productivity also.)
Git history to webpage!! That idea is intriguing for so many reasons! I’ll definitely be checking that out. Thank you for telling me about it.
> For one, Mercurial has no staging area. That removes one level of the three-level
> hierarchy from my toolset. It’s hard to identify exactly when in my workflow this
> causes issues, but I’ve started to notice it. For example, it’s not possible to
> commit a hunk from my editor like I can with git and vim-gitgutter.
Ah, thanks, I see.
> Mercurial also collapses all changes within a pull request (changeset) into a
> single commit. That removes the meaningful difference between the top level (pull
> request) and the mid level (commit) that I find helpful to narrate. There is some
> ability when working locally to create a bunch of commits like I would in git, and
> then later squash them all using hg histedit. But my reviewers can’t see the
> individual commits, nor can they be seen or reverted individually in the long term
> project history.
Oh, that sucks for sure. However I can’t help but wonder: the “merge/pull requests” functional that you see on gitlab/github are purely a feature of the site. For bare git the concept of pull/merge request doesn’t really exist. There’s a `git request-pull` though, but from what I’ve read, it simply generates some text info.
If Mercurial has some poor analog to merge/pull requests implemented in its core, then I think it’s fair to say that they have more than git does. I.e. because git in its core doesn’t have that implemented.
>> For example, it’s not possible to commit a hunk
>> from my editor like I can with git and vim-gitgutter.
> Ah, thanks, I see.
In fact, there’s nothing preventing this; and also Mercurial comes with its own tool for interactive commit, hg commit -i, previously known as hg crecord.
> If Mercurial has some poor analog to merge/pull requests implemented
> in its core, then I think it’s fair to say that they have more than
> git does. I.e. because git in its core doesn’t have that implemented.
No, it does not. In fact, Jeremy has apparently been confused by the terminology mismatch between Git and Mercurial because a changeset is a Mercurial term which roughly means a commit. Or rather it is like this: a commit is a state of the repository at a given time, while a changeset is a transformation between one such state and the next one.
Omg. I’m sorry this reply is not in thread, there’s a bug on this site: I couldn’t post a comment, because the site was giving me “you need to fill out name/email” (despite me being logged in). I managed to circumvent it by logging out and in, but after that pressing “post comment” sent it to the top thread instead of the one I was replying to. Sorry.
Thanks, very well written!
I have pretty much same git workflow, so it was almost like reading my own thoughts.
Another benefit of interactive add/rebase for me is being able to merge changes in master way more frequently. E.g. if I did some unrelated small fix/refactoring it’s just a matter of cherry picking on separate branch and interactive rebase afterwards.
I don’t remember last time I’ve hade to resolve nontrivial merge conflicts with other people, despite working on some of my feature branches for months.
Very well put! I’ve been thinking about it recently, as I’m striving to do something similar (using commits to make PRs easier to review) but couldn’t sort out my thought and articulate the exact process of ‘telling a story using git history’ in quite a clear way. The mapping of the tree layers “hunk-commit-pull” to the three layers of commitment is enlightening! Thank you for writing this.
If I understand correctly, in order to tell a story using commits, you do have some commits that leave the project in a non-working state (A non complete refactoring, for example). How do you deal with commands such ‘git bisect’, which requires all the commits to leave the project in a working, functioning state?
That’s a good question! In that case I would personally do a squash commit, but I know that conflicts with some organizations’ goal to ensure all commits are preserved from each branch. Those two desired seem in direct conflict.
A bit of quick reading on git-bisect shows the git team has thought of this case and has a means to workaround it: https://git-scm.com/docs/git-bisect#_avoiding_testing_a_commit
So then one could jump forward to the working commit at the end of the pull request.
Have you tried looking at patch queues? I find them more powerful than rebasing (but I haven’t worked in Mercurial for a good while now).