Production Pressures

It’s been a hot minute.

One of my favourite concepts from Normal Accidents is “production pressures.” The gist of it is that safety devices are sometimes counterbalanced by socio-economic pressures to do more or go faster. Charles Perrow describes a literal example of going faster with the introduction of shipboard radar: intended to improve maritime safety, it instead was used to “enable the navigators to prosecute their voyage with greater economical efficiency,” with no apparent impact to per-ship risk. Captains who would have previously dropped anchor in a dense fog could now use radar to push onward – not necessarily of any intrinsic desire to do so, but because of the ship owner’s demands (or pressures).

Does any of this sound familiar?

Things in the world of high-availability sure have changed, and I wonder if the counterbalance has also been hard at work. Ten years ago, a server (figuratively) exploding in the middle of the night was cause enough to page a sysadmin to assess the damage lest a severe outage take place. These days, cloud-based operators are unlikely to even notice a solitary VM failing because things like autoscaling groups and replica sets replace crashed servers without the need of human intervention. A new capability like this should have probably made our systems more reliable, and absent a big paradigm shifts, maybe it even did.

“Absent a big paradigm shifts” is a pretty big caveat because as paradigms shifts go we’ve had several. In particular: the ability to quickly recover from a hardware failure is only one of many capabilities expressed by the underlying ability to treat VMs and containers as fungible, replaceable, or ephemeral resources. We can do more than just replace a failed server quickly – we can do lots of different things quickly now! And as for the impacts on safety? The authors of Behind Human Error put it thusly:

…when change is undertaken to improve systems under pressure, the benefits of change may be consumed in the form of increased productivity and efficiency and not in the form of a more resilient, robust and therefore safer system.

Does any of that sound familiar?

The ability to ship more and ship faster may seem notionally good. The number of hours in a day haven’t changed though, so per-widget attention must be further subdivided. That’s fine! You’re more productive than ever! Or at least, you are until something explodes. You’d better hope nothing else is on fire at the same time, which is a funny thing to hope for, seeing as those productivity gains have enabled you to operate twice as many widgets as before. That’s another reason why I like the term “production pressures”: while it canonically describes a stakeholder class demanding ever-increasing productivity, “pressure” is also a good word to describe the ever-increasing set of responsibilities and cognitive load that a software team is expected to compress down and fit into a regular (finite) workday.

I suppose this blag should offer some advice.

The first and most important is to recognize the existence of production pressures. That may seem underwhelming as advice goes, but consider how the future is essentially fiated into existence by people talking to one another. This is a framework to talk about things in a non-ambiguous way. Saying “we always seem to be busy” can be a symptom of so many different things and you don’t want a misdiagnosis, like “just use Jira,” whereas articulating production pressures means you can have a conversation about an actual solution.

The second bit of advice will be familiar to some. Production pressures can encourage more automation and removal of humans from the system in the name of doing more (sometimes under the guise of increased reliability, too). These automated systems inevitably tend towards being complex, and bespoke versions are often tightly coupled to an organization’s particular workflow. Eschew complexity. Instead pursue simple, decoupled systems aggressively, and design in operator intervention. This is perhaps a good tendency in general, but it serves especially well when facing production pressures. While this does not necessarily reduce the number of widgets a team has, it can provide more opportunities to absorb and respond to unintended system behaviours without pulling production down with it.

Or don’t, I’m not the boss of you. If nothing else: learn to recognize production pressures and be intentional about how you respond to it.