You’re probably familiar with the 80-20 rule: when 80% of the X stems from only 20% of the Y. For example, 80% of your revenue comes from only 20% of your customer, or 80% of the logs that you’re storing are generated from only 20% of the services. Talk to anybody who is looking to reduce cloud costs in their organization, and chances are they’re attacking the problem by looking for the 20% of services that are generating 80% of the costs, rather than trying to reduce cloud usage uniformly across all services.

Not all phenomena follow the 80-20 rule, but it’s common enough in the systems we encounter that it’s a good rule of thumb. The technical term for it is the Pareto principle, and distributions that exhibit this 80-20 phenomena are an example of Pareto distributions, also known as power law distributions.

A common implicit assumption is that availability problems are Pareto-distributed. If you look at incidents and keep track of their causes, you should be able to identify a small number of causes that lead to the majority of the incidents. Because of this, we should attribute a cause to an incident, and then look to see which causes most often contribute to incident in order to identify interventions that will have the largest impact: those 20% of improvements that should yield 80% improvements. If you believe in the RCA (root cause analysis) model of incidents, that’s a reasonable assumption to make: identify the root cause of each incident, track these across incidents, and then invest in projects that attack the most expensive root causes.

If we can identify the problematic red dots, we can achieve significant improvements

But if you’re a frequent reader of this blog, you know there’s an alternative model of how incidents come to be. I’m fond of referring to the alternative as the LFI (learning from incidents) model. However, in the safety science research community this alternative model is more commonly associated with terms such as the New View, the New Look, or Safety-II.

The contrast of the LFI model with the RCA model is captured well in Richard Cook’s famous monograph, How Complex Systems Fail:

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident.

In this model, incidents don’t happen because of a single cause: rather, it’s through the interaction of multiple contributors.

If incidents stem from problematic interactions rather than problematic components, then focusing on components will lead us astray

Under an incident model where incidents are a result of interactions, we wouldn’t expect there to be a Pareto distribution of causes. This means that if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents. Instead, by looking at the components instead of the interactions, we’re unlikely to see much of any pattern at all.

Turning back to How Complex Systems Fail again:

Complex systems are heavily and successfully defended against failure.

The high consequences of failure lead over time to the construction of multiple layers of defense against failure. These defenses include obvious technical components (e.g.
backup systems, ‘safety’ features of equipment) and human components (e.g. training, knowledge) but also a variety of organizational, institutional, and regulatory defenses (e.g. policies and procedures, certification, work rules, team training). The effect of these measures is to provide a series of shields that normally divert operations away from accidents.

All of these explicit and implicit components of the system work together to keep things up and running. You can think of this as a process that most of the time generates safety (or availability), but sometimes doesn’t. You can think of these failure cases as a sort of error or residual term, they’re the leftover, the weird cases at the edges of our system. I think treating incidents as an error term is a useful metaphor because we don’t fall into the trap of thinking about error terms as looking like Pareto distributions.

This doesn’t mean that there aren’t patterns of failure in our incidents: there absolutely are. But it means that the patterns we need to look for aren’t going to visible if we don’t ask the right questions. It’s the difference between asking “which services were involved?” and “what were the goal conflicts that the engineers were facing?

