At Google, our organization designs, owns, and maintains a number of optimization models that automate the planning of Google’s datacenter growth and health. As is pretty standard in supply chain optimization and planning, these models are often integer linear programs. It’s a core competency of operations research, after all.
One might think, “Large optimization problems? That sounds hard!” But it’s actually far from the hardest part of the job. In fact, it’s one of the few exciting parts of the job. The real hard part is getting data. Really, it’s that you get promised data that never materializes, and then you architect your system for features that rot before they ripen.
There’s a classic image of a human acting as if they’re throwing a ball for a dog, and the dog sprints off, only soon to realize the ball was never thrown. The ball is the promise of freshly maintained data, and recently I’ve been the dog.
When you don’t have good data, or you have data that’s bad in a known way, you can always try to design your model to accommodate for the deficiencies. As long as it’s clearly defined, it’s not beyond our reach. The math is fun and challenging, and I don’t want to shy away from it. My mathematician’s instinct pulls me left.
My instincts as an engineer pull me right: data issues will ultimately cause unexpected edge cases at the worst moment, and it will fall on me to spend all day debugging for a deadline tomorrow. Data issues lead to more complicated modeling features which further interact with other parts of the model and the broader system in confounding ways. Worst of all, it’s nearly impossible to signal problems to customers who depend on your output. When technical debt is baked into an optimization model as features, it makes explanation much harder. Accepting bad data also requires you write the code in a way that is easy to audit, since you need to audit literally everything. Transparency is good, but it’s tedious to do it generically well, and the returns are not worth it if the end result is, “well we can’t fix the data for two years anyway.”
Though a lot of this technical debt was introduced by predecessors who left the team, I’ve fallen for the mathematical siren’s call a few times. Go on, just add that slick new constraint. Just mask that misbehavior with a heuristic. It’s bit back hard and caused months of drag.
These days I’m swinging hard right on the pendulum. Delete half-implemented features that don’t have data to support them. Delete features that don’t have a clear connection to business needs (even if they work). Push back on new feature requests until the data exists. Require a point of contact and an SLO for any data you don’t own. Make speculative features easy to turn on/off (or remove!) without having to redesign the architecture. If it can’t be made easy to remove, don’t add it until you’re sure it will survive.
If you can’t evade bad data, err on the side of strict initial validation, and doing nothing or gracefully degrading service when validation fails. Expose the failures as alerts to the people who own the data (not you), and give the data owners a tool that repeats the validation logic in your system verbatim, so there is no confusion on the criteria for success. When you have this view, almost all of the complexity in your system lies in enabling this generic auditing, alerting, and managing of intricate (but ultimately arbitrary) policy.
I like to joke that I don’t have data-intensive applications or problems of scale, but rather policy-intensive applications. I haven’t found much insight from other software engineers about how to design and maintain policy-intensive software. Let me know if you have some! The obvious first step is to turn policy code into data. To the extent that we’ve done this, I adore that aspect of our systems. Still, you can’t avoid it when policies need to be encoded in an optimization model.
I do get sad that so much of my time is spent poop-smithing, as I like to say, even though we’re gradually getting better. Our systems need maintenance and care, and strong principles to keep the thicket from overwhelming us. For one, I track net lines of code added, with the goal to have it be net negative month over month, new features and all. We’ve kept it up for six months so far. Even our fixit week this week seems unnecessary, given how well our team has internalized paying off technical debt.
Though I do wonder what it’s all for. So Google can funnel the money it saves on datacenter costs into
informing people the Earth is flat cat videos? If I didn’t have two particular internal side projects to look forward to—they involve topics I’m very interested in—I’d be bored, and I might succumb to jaded feelings, and I’d need a change. Certain perks and particularly enjoyable colleagues help avoid that. But still, I rarely have time to work on the stimulating projects, and even the teammates I’ve been delegating to often defer it to other priorities.
We let dirty data interfere with our design and architecture, now we’re paying back all that technical debt, and as a consequence there’s no time for our human flourishing.
I should open a math cafe.
Well-written, and thanks for this. I do wonder, some issues of “dirty data” are likely unavoidable, and come with the territory when working with real-life systems. Is it true that other issues could be avoided with better foresight or project policies at Google? I hoped the work environment at Google would be the gold standard, partly because of the influence of the work there.
The talk of policy intensive data makes me wonder if something like a dependant type system might be useful, to have a sort of policies-as-types sort of setup. I work so far outside of the big data space though so maybe I’m missing something basic here.
Having first-class language support for policy decisions is an interesting idea, but somehow I don’t see it happening in any major languages. I wonder if a system like TLA+ could help, but I wouldn’t know how to model the outcomes that are being checked. To me the policies often feel arbitrary and based on human context that isn’t encoded in any system.
Well, yeah, it’s not a feature of major languages yet, but I hope it will be in the next few years.
The fact that you can’t explicitly write down the policies like that seems to be a bigger problem though. I get that getting them perfect right off would be hard, but shouldn’t getting it somewhat close and then iterating towards a correct description be possible?
I do wonder is that dirty data can hide some important information that your team is not able to appreciate. Raw data can contain useful information that at first seem unvaluable.