On March 15th, 2023 we held the first iteration of an event that I've decided to call Systems In Motion, for need of a better but less-descriptive title than "Experienced Backend, Infra, SRE-ish, DevOps, and Engineering Managers Discussion Group." At some point I want to write more on how and why it came together, but for now you can see my draft manifesto here.
Our subject for the night was "When and How To Fail", though we also wandered into what failure is, and why we do it. I've done my best to increase the contrast on the photo, and I've listed some of the points I personally found most resonant below.
Points that stuck with me:
- Failure is many things. It is a process, not an instant. It's often helpful to think backwards from failure (an outcome) to the inputs that caused the failure.
- Failure is subjective. The same failure for two different organizations can yield vastly different responses, which tend to be rooted in user expectations. I found it useful to think about failures in terms of these two points, inputs and expectations.
- "Blamelessness" has been drained of its original meaning, and has roughly become table-stakes in sophisticated organizations. We should strive for a compassionate failure culture.
- "Safety Nets" as an alternative to "Guard Rails"
- In many of our contexts, some failures (immediate, terminal) are more desirable than others (slow data corruption). Just like we can build stable systems out of unstable components, we can build good UX on top of fast-failing products. These fast, small failures are also a great way to create an anti-fragile experience, which can eventually become a product moat. Manual "Escape Hatches" are a useful concept to think about.
- We try to measure and predict the "impact" of failure, but that is pretty vague. I came away from the discussion with a more decomposed model of "impact", which roughly reduced to reversibility, time-sensitivity, and blast radius.
- Engineers are always thinking about failure, to the point where we often assume failure. This may create a passive over-emphasis, but it also means that we have developed some of the most sophisticated organizational cultures around investigating failure. Engineering post-mortem review culture is something that other departments could benefit from.
- Organizations have operational functions (a bank both borrows and lends), and each function has potential failures that exist on both a predictability gradient and a risk gradient. As orgs grow, increasing both the number of functions and the users per-function, the types of risk that we must think about become less predictable (predictable failures get smoothed over), but the impact of failure for individual transactions moves lower on the risk gradient. One failure is less critical to the survival of a business, but in aggregate the failures are more expensive. This is for a variety of reasons, but two factors are that escape hatches become more expensive, and that expectations are higher. This might be a useful avenue to think about when picking apart why an organization is ossifying, which is a subject that was running through my mind as our discussion wrapped up.
For the millionth time, a sincere thank you to everyone who came out, I was deeply touched that so many folks were willing to take a chance on the evening. See you soon!