You can’t judge risk in hindsight

A while back, the good folks at Google SRE posted an article titled Lessons Learned from Twenty Years of Site Reliability Engineering. There’s some great stuff in here, but I wanted to pick on the first lesson: The riskiness of a mitigation should scale with the severity of the outage. Here are some excerpts from the article (emphasis mine)

Let’s start back in 2016, when YouTube was offering your favorite videos such as “Carpool Karaoke with Adele” and the ever-catchy “Pen-Pineapple-Apple-Pen.” YouTube experienced a fifteen-minute global outage, due to a bug in YouTube’s distributed memory caching system, disrupting YouTube’s ability to serve videos.
…
We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it’s meant to resolve. During the aforementioned YouTube outage, a risky load-shedding process didn’t fix the outage… it instead created a cascading failure.

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.

The question I had reading this was: how did the authors make the judgment that the load-shedding mitigation was risky? In particular, how was the risk of the mitigation perceived in the moment? Note: this question is still relevant, even if the authors/contributors were the actual responders!

When a bad outcome happens, it’s easy to say with hindsight that the action was risky. But we can really only judge the riskiness based on what was understood by the operators at the time they had to make the call. As the good Dr. Cook noted in the endlessly quotable How Complex Systems Fail, all practitioner actions are gambles:

After accidents, the overt failure often appears to have been inevitable and the
practitioner’s actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

I have no firsthand knowledge of this particular incident. But, just as nobody ever wakes up and says “I’m going to do a bad job today”, nobody wakes up and says “I’m going to take unnecessary risks today.” Doing operations work means making risk trade-offs under uncertainty. We generally don’t know in advance how risky a particular mitigation will be. I think the real lessons is to recognize the inherent challenge that operators face in these scenarios.

The problem with a root cause is that it explains too much

The recent performance of the stock market brings to mind the comment of a noted economist who was once asked whether the market is a good leading indicator of general economic activity. Wonderful, he replied sarcastically, it has predicted nine of the last four recessions. – Alfred L. Malabre Jr., 1968 March 4, The Wall Street Journal

In response to my previous post, Peter Ludemann made the following observation on Mastodon:

This post makes the case for why I would still call these contributors rather than root causes, even though they certainly sound root-cause-y. (They’re also fantastic examples of risks that are very common in the types of systems we work in, but that’s not the topic of this particular post).

Let’s take the first one, “a configuration system that makes mistakes easy.” I’d ask the question, “does an incident occur every single time somebody uses the configuration system?” I don’t know the details of the particular incident(s) that Peter is alluding to, but I’m willing to bet that this isn’t true. Rather, I assume what he is saying is that the configuration system is fundamentally unsafe in some way (e.g., it’s too easy to unintentionally take a dangerous action), and every once in a while a dangerous mistake would happen and an incident would occur.

What this means is that the unsafe configuration system by itself isn’t sufficient for the incident to occur! The config system enables incidents to occur, but it doesn’t, by itself, create the incident. Rather, it’s a combination of the configuration system, and some other factors, that trigger incidents. Maybe incidents only manifests when there is a particular action a user is trying to take, or maybe some people know how to work around the sharp edges and others don’t, or other things.

This may sound like sophistry. After all, the configuration system is an unsafe operator interface. The lesson from an incident is that we should fix it! However, here’s the problem with that line of thinking. The truth is that there are many types of these sorts of problems in a system. I like to call these problems vulnerabilities, even though people usually reserve that term in a security context. Peter gives three examples, but our systems are really shot through with these sorts of vulnerabilities. There are all sorts of unsafe operator interfaces, assumptions that have become invalidated with change, dangerous potential interactions between components, and so on. These vulnerabilities are the sorts of issues that the safety researcher James Reason referred to as latent pathogens. Reason is the one who proposed the Swiss cheese model, with the latent pathogens being the holes in the cheese.

My problem with labeling these vulnerabilities as root causes is that this obscures how our systems actually spend most of their time up, even though these vulnerabilities are always present. Let’s say you were able to identify every vulnerability you had in a system. If you label each one as a root cause of an outage, then your system should be down all of the time, because these vulnerabilities are all present in your system!

But your system isn’t down all of the time: in fact, it’s up more often than it’s down, even though these vulnerabilities are omnipresent. And the reason your system is up more than it’s down is that these vulnerabilities are not, by themselves, sufficient to take down a system. If you label these vulnerabilities as root causes, you make it impossible to understand to how your system actually succeeds. And if you don’t know how it succeeds, you can’t understand how it fails. You’re like the economist predicting recessions that don’t happen.

Now, whether we label these vulnerabilities as root causes or not, they clearly represent a risk to your system. But we have an additional problem: we live in the adaptive universe. That means we don’t actually have the resources (in particular, the time) to identify and patch all of these vulnerabilities. And, even if we could stop the world, find them all, and fix them all, and start the world again, our system keeps changing over time, and new vulnerabilities would set in. And that doesn’t even take into account how patching these vulnerabilities can create new ones. The adaptive universe also teaches us that our work will inevitably introduce new vulnerabilities because we only have a finite amount of time to actually do that work. Mistaking problems with individual components with the general problem of finite resources is the component substitution fallacy.

In short, labeling vulnerabilities as root causes is dangerous because it blinds us to the nature of how complex systems manage to stay up and running most of the time, even though vulnerabilities within the system are always with us. Now, these vulnerabilities are still risks! However, they may or may not manifest as incidents. In addition, we can’t predict which ones will bite us, and we don’t have the resources to root all of them out. We use “this just bit us so we should address it because otherwise it will bite us again” a heuristic, but it’s an implicit one. What we should be asking is “given that we have limited resources, is spending the time addressing this particular vulnerability worth the opportunity cost of delaying other work?”

The error term isn’t Pareto distributed

You’re probably familiar with the 80-20 rule: when 80% of the X stems from only 20% of the Y. For example, 80% of your revenue comes from only 20% of your customer, or 80% of the logs that you’re storing are generated from only 20% of the services. Talk to anybody who is looking to reduce cloud costs in their organization, and chances are they’re attacking the problem by looking for the 20% of services that are generating 80% of the costs, rather than trying to reduce cloud usage uniformly across all services.

Not all phenomena follow the 80-20 rule, but it’s common enough in the systems we encounter that it’s a good rule of thumb. The technical term for it is the Pareto principle, and distributions that exhibit this 80-20 phenomena are an example of Pareto distributions, also known as power law distributions.

A common implicit assumption is that availability problems are Pareto-distributed. If you look at incidents and keep track of their causes, you should be able to identify a small number of causes that lead to the majority of the incidents. Because of this, we should attribute a cause to an incident, and then look to see which causes most often contribute to incident in order to identify interventions that will have the largest impact: those 20% of improvements that should yield 80% improvements. If you believe in the RCA (root cause analysis) model of incidents, that’s a reasonable assumption to make: identify the root cause of each incident, track these across incidents, and then invest in projects that attack the most expensive root causes.

If we can identify the problematic red dots, we can achieve significant improvements

But if you’re a frequent reader of this blog, you know there’s an alternative model of how incidents come to be. I’m fond of referring to the alternative as the LFI (learning from incidents) model. However, in the safety science research community this alternative model is more commonly associated with terms such as the New View, the New Look, or Safety-II.

The contrast of the LFI model with the RCA model is captured well in Richard Cook’s famous monograph, How Complex Systems Fail:

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident.

In this model, incidents don’t happen because of a single cause: rather, it’s through the interaction of multiple contributors.

If incidents stem from problematic interactions rather than problematic components, then focusing on components will lead us astray

Under an incident model where incidents are a result of interactions, we wouldn’t expect there to be a Pareto distribution of causes. This means that if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents. Instead, by looking at the components instead of the interactions, we’re unlikely to see much of any pattern at all.

Turning back to How Complex Systems Fail again:

Complex systems are heavily and successfully defended against failure.

The high consequences of failure lead over time to the construction of multiple layers of defense against failure. These defenses include obvious technical components (e.g.
backup systems, ‘safety’ features of equipment) and human components (e.g. training, knowledge) but also a variety of organizational, institutional, and regulatory defenses (e.g. policies and procedures, certification, work rules, team training). The effect of these measures is to provide a series of shields that normally divert operations away from accidents.

All of these explicit and implicit components of the system work together to keep things up and running. You can think of this as a process that most of the time generates safety (or availability), but sometimes doesn’t. You can think of these failure cases as a sort of error or residual term, they’re the leftover, the weird cases at the edges of our system. I think treating incidents as an error term is a useful metaphor because we don’t fall into the trap of thinking about error terms as looking like Pareto distributions.

This doesn’t mean that there aren’t patterns of failure in our incidents: there absolutely are. But it means that the patterns we need to look for aren’t going to visible if we don’t ask the right questions. It’s the difference between asking “which services were involved?” and “what were the goal conflicts that the engineers were facing?“

Green is the color of complacency

Here are a few anecdotes about safety from the past few years.

In 2020, the world was struck by the COVID-19 pandemic. The U.S. response was… not great. Earlier in 2019, before the pandemic struck, the Johns Hopkins Center for Health Security released a pandemic preparedness assessment that ranked 195 countries on how well prepared they were to deal with a pandemic. The U.S. was ranked number one: it was identified as the most well-prepared country on earth.

With its pandemic playbook, “The U.S. was very well prepared,” said Eric Toner, senior scholar at the Johns Hopkins Center for Health Security. “What happened is that we didn’t do what we said we’d do. That’s where everything fell apart. We ended up being the best prepared and having one of the worst outcomes.”

On October 29, 2018, Lion Air Flight 610 crashed 13 minutes after takeoff, killing everyone on board. This plane was a Boeing 737 MAX, and a second 737 MAX had a fatal crash a few months later. Seven days prior to the Lion Air crash, the National Safety Council presented the Boeing Company with the Robert W. Campbell Award for leadership in safety:

“The Boeing Company is a leader in one of those most safety-centric industries in the world,” said Deborah A.P. Hersman, president and CEO of the National Safety Council. “Its innovative approaches to EHS excellence make it an ideal recipient of our most prestigious safety award. We are proud to honor them, and we appreciate their commitment to making our world safer.”

On April 20th, 2010, an explosion on the Deepwater Horizon offshore drilling rig killed eleven workers and led to the largest marine oil spill in the history of the industry. The year before, the U.S. Minerals Management Service issued its SAFE award to Deepwater Horizon:

MMS issued its SAFE award to Transocean for its performance in 2008, crediting the company’s “outstanding drilling operations” and a “perfect performance period.” Transocean spokesman Guy Cantwell told ABC News the awards recognized a spotless record during repeated MMS inspections, and should be taken as evidence of the company’s longstanding commitment to safety.

When things are going badly, everybody in the org knows it. If you go into an organization where high-severity incidents are happening on a regular basis, where everyone is constantly in firefighting mode, then you don’t need metrics to tell you how bad things are: it’s obvious to everyone, up and down the chain. The problems are all-too-visible. Everybody can feel them viscerally.

It’s when things aren’t always on fire that it can be very difficult to assess whether we need to allocate additional resources to reduce risk. As the examples above show, absence of incidents do not indicate an absence of risk. In fact, these quiet times can lull is into a sense of complacency, leading us to think that we’re in a good spot, when the truth is that there’s a significant risk that’s hidden beneath the surface.

Personally, I don’t believe it’s even possible to say with confidence that “everything is ok with right now”. As the cases above demonstrate, when things are quiet, there’s a limit to how well we can actually assess the risk based on the kinds of data we traditionally collect.

So, should you be worried about your system? If you find yourself constantly in firefighting mode, then, yes, you should be worried. And if things are running smoothly, and the availability metrics are all green? Then, also yes, you should be worried. You should always be worried. The next major incident is always just around the corner, no matter how high your ranking is, or how many awards you get.

The perils of outcome-based analysis

Imagine you wanted to understand how to get better at playing the lottery. You strike upon a research approach: study previous lottery winners! You collect a list of winners, look them up, interview them about how they go about choosing their numbers, collate this data, identify patterns, and use these to define strategies for picking numbers.

The problem with this approach is that it doesn’t tell you anything about how effective these strategies actually are. To really know how well these strategies work, you’d have to look at the entire population of people who employed them. For example, say that you find that most lottery winners use their birthdays to generate winning numbers. It may turn out, that for every winning ticket that has the ticket holder’s birthday, there are 20 million losing tickets that also have the ticket holder’s birthday. To understand a strategy’s effectiveness, you can’t just look at the winning outcomes: you have to look at the losing outcomes as well. The technical term for this type of analytic error is selecting on the dependent variable.

Here’s another example of this error in reasoning: according to the NHTSA, 32% of all traffic crash fatalities in the United States involve drunk drivers. That means that 68% of all traffic crash fatalities involve sober drivers. If you only look at scenarios that involve crash fatalities, it looks like being sober is twice as dangerous as being drunk! It’s a case of only looking at the dependent variable: crash fatalities. If we were to look at all driving scenarios, we’d see that there are a lot more sober drivers than drunk drivers, and that any given sober driver is less likely to get into a crash fatality than a given drunk driver. Being sober is safer, even though sober drivers appear more often in fatal accidents than drunk drivers.

Now, imagine an organization that holds a weekly lottery. But it’s a bizarro-world type of lottery: if someone wins, then they receive a bad outcome instead of a good one. And the bad outcome doesn’t just impact the “winner” (although they are impacted the most), it has negative consequences for the entire organization. Nobody would willingly participate in such a lottery, but everyone in the organization is required to: you can’t opt out. Every week, you have to buy a ticket, and hope the numbers you picked don’t come up.

The organization wants to avoid these negative outcomes, and so they try to identify patterns in how previous lottery “winners” picked their numbers, so that they can reduce the likelihood of future lottery wins by warning people against using these dangerous number-picking strategies.

At this point, the comparison to how we treat incidents should be obvious. If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

The real-world case is even worse than the lottery case: lotteries really are random, but that way that people do their work isn’t; rather, it’s adaptive. People do work in specific ways because they have found that it’s an effective way to get stuff done given that the constraints that they are under. The only way to really understand why people work the way they do is to understand how those adaptations usually succeed. Unless you’re really looking for it, you aren’t going to be able to learn how people develop successful adaptations if you only ever examine the adaptations when they fail. Otherwise, you’re just doing the moral equivalent of asking what lottery winners have in common.