Action item template

We’re thrilled that you want to contribute to improving the system in the wake of an incident! For each post-incident action that you are proposing, we would appreciate it if you would fill out the following template.

Please estimate the expected benefits associated with implementing the action item. For example, if this reduces risk, by how much? Please document your risk model. How will you validate this estimate?

Please estimate the costs associated with implementing the proposed action items. In particular:

What are the costs in engineering effort (person-days of work) to do the initial implementation?
What are the ongoing maintenance costs in terms of engineering effort?
What are the additional infrastructure costs?

In addition, please estimate the opportunity costs associated with this action item: if this action item is prioritized, what other important work will be deprioritized as a result? What were the expected benefits of the deprioritized work? How do these unrealized benefits translate into additional costs or risks?

Given that we know we can never implement things perfectly (otherwise the incident wouldn’t have happened, right?), what are the risks associated with a bug or other error when implementing the proposed action item?

Even if the action item is implemented flawlessly, the resulting change in behavior can lead to unforeseen interactions with other parts of the system. Please generate a list of potential harmful interactions that could arise when this action item is implemented. Please be sure to track these and refer back to them if a future incident occurs that involves this action item, to check how well we are able to reason about such interactions.

More generally: will the proposed action item increase or decrease the overall complexity of the system? If it will increase complexity, compare what the costs and/or risks are of the resulting increase in complexity, and compare these to the proposed benefits of the implemented action item.

Will the proposed action item increase or decrease the overall cognitive load on people? If it will increase cognitive load, please estimate the expected magnitude of this increase, and document a plan for evaluating the actual increase after the action item has been implemented.

Beyond cognitive load, is this action going to prevent or otherwise make more difficult any work that goes on today? How will you identify whether this is the case or not? Please document your plan for measuring the resultant increase in difficulty due to the action item.

More generally: will the implementation of this action item lead to people changing the way they do they work? What sort of workarounds or other adaptations may occur as a result, and what are the associated risks of these? What are all of the different types of work that go on in the organization that will be impacted by this change? How will you verify that your list is complete? Please document your plan for studying how the work has actually changed in response to this action item, and how you will contrast the findings with your expectations.

Once the action item has been completed, how are you going to track it unexpectedly contributes to an incident in the future? Please outline your plan for how we will maintain accountability for the impact of completed action items.

You can’t judge risk in hindsight

A while back, the good folks at Google SRE posted an article titled Lessons Learned from Twenty Years of Site Reliability Engineering. There’s some great stuff in here, but I wanted to pick on the first lesson: The riskiness of a mitigation should scale with the severity of the outage. Here are some excerpts from the article (emphasis mine)

Let’s start back in 2016, when YouTube was offering your favorite videos such as “Carpool Karaoke with Adele” and the ever-catchy “Pen-Pineapple-Apple-Pen.” YouTube experienced a fifteen-minute global outage, due to a bug in YouTube’s distributed memory caching system, disrupting YouTube’s ability to serve videos.
…
We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it’s meant to resolve. During the aforementioned YouTube outage, a risky load-shedding process didn’t fix the outage… it instead created a cascading failure.

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.

The question I had reading this was: how did the authors make the judgment that the load-shedding mitigation was risky? In particular, how was the risk of the mitigation perceived in the moment? Note: this question is still relevant, even if the authors/contributors were the actual responders!

When a bad outcome happens, it’s easy to say with hindsight that the action was risky. But we can really only judge the riskiness based on what was understood by the operators at the time they had to make the call. As the good Dr. Cook noted in the endlessly quotable How Complex Systems Fail, all practitioner actions are gambles:

After accidents, the overt failure often appears to have been inevitable and the
practitioner’s actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

I have no firsthand knowledge of this particular incident. But, just as nobody ever wakes up and says “I’m going to do a bad job today”, nobody wakes up and says “I’m going to take unnecessary risks today.” Doing operations work means making risk trade-offs under uncertainty. We generally don’t know in advance how risky a particular mitigation will be. I think the real lessons is to recognize the inherent challenge that operators face in these scenarios.

The problem with a root cause is that it explains too much

The recent performance of the stock market brings to mind the comment of a noted economist who was once asked whether the market is a good leading indicator of general economic activity. Wonderful, he replied sarcastically, it has predicted nine of the last four recessions. – Alfred L. Malabre Jr., 1968 March 4, The Wall Street Journal

In response to my previous post, Peter Ludemann made the following observation on Mastodon:

This post makes the case for why I would still call these contributors rather than root causes, even though they certainly sound root-cause-y. (They’re also fantastic examples of risks that are very common in the types of systems we work in, but that’s not the topic of this particular post).

Let’s take the first one, “a configuration system that makes mistakes easy.” I’d ask the question, “does an incident occur every single time somebody uses the configuration system?” I don’t know the details of the particular incident(s) that Peter is alluding to, but I’m willing to bet that this isn’t true. Rather, I assume what he is saying is that the configuration system is fundamentally unsafe in some way (e.g., it’s too easy to unintentionally take a dangerous action), and every once in a while a dangerous mistake would happen and an incident would occur.

What this means is that the unsafe configuration system by itself isn’t sufficient for the incident to occur! The config system enables incidents to occur, but it doesn’t, by itself, create the incident. Rather, it’s a combination of the configuration system, and some other factors, that trigger incidents. Maybe incidents only manifests when there is a particular action a user is trying to take, or maybe some people know how to work around the sharp edges and others don’t, or other things.

This may sound like sophistry. After all, the configuration system is an unsafe operator interface. The lesson from an incident is that we should fix it! However, here’s the problem with that line of thinking. The truth is that there are many types of these sorts of problems in a system. I like to call these problems vulnerabilities, even though people usually reserve that term in a security context. Peter gives three examples, but our systems are really shot through with these sorts of vulnerabilities. There are all sorts of unsafe operator interfaces, assumptions that have become invalidated with change, dangerous potential interactions between components, and so on. These vulnerabilities are the sorts of issues that the safety researcher James Reason referred to as latent pathogens. Reason is the one who proposed the Swiss cheese model, with the latent pathogens being the holes in the cheese.

My problem with labeling these vulnerabilities as root causes is that this obscures how our systems actually spend most of their time up, even though these vulnerabilities are always present. Let’s say you were able to identify every vulnerability you had in a system. If you label each one as a root cause of an outage, then your system should be down all of the time, because these vulnerabilities are all present in your system!

But your system isn’t down all of the time: in fact, it’s up more often than it’s down, even though these vulnerabilities are omnipresent. And the reason your system is up more than it’s down is that these vulnerabilities are not, by themselves, sufficient to take down a system. If you label these vulnerabilities as root causes, you make it impossible to understand to how your system actually succeeds. And if you don’t know how it succeeds, you can’t understand how it fails. You’re like the economist predicting recessions that don’t happen.

Now, whether we label these vulnerabilities as root causes or not, they clearly represent a risk to your system. But we have an additional problem: we live in the adaptive universe. That means we don’t actually have the resources (in particular, the time) to identify and patch all of these vulnerabilities. And, even if we could stop the world, find them all, and fix them all, and start the world again, our system keeps changing over time, and new vulnerabilities would set in. And that doesn’t even take into account how patching these vulnerabilities can create new ones. The adaptive universe also teaches us that our work will inevitably introduce new vulnerabilities because we only have a finite amount of time to actually do that work. Mistaking problems with individual components with the general problem of finite resources is the component substitution fallacy.

In short, labeling vulnerabilities as root causes is dangerous because it blinds us to the nature of how complex systems manage to stay up and running most of the time, even though vulnerabilities within the system are always with us. Now, these vulnerabilities are still risks! However, they may or may not manifest as incidents. In addition, we can’t predict which ones will bite us, and we don’t have the resources to root all of them out. We use “this just bit us so we should address it because otherwise it will bite us again” a heuristic, but it’s an implicit one. What we should be asking is “given that we have limited resources, is spending the time addressing this particular vulnerability worth the opportunity cost of delaying other work?”

The error term isn’t Pareto distributed

You’re probably familiar with the 80-20 rule: when 80% of the X stems from only 20% of the Y. For example, 80% of your revenue comes from only 20% of your customer, or 80% of the logs that you’re storing are generated from only 20% of the services. Talk to anybody who is looking to reduce cloud costs in their organization, and chances are they’re attacking the problem by looking for the 20% of services that are generating 80% of the costs, rather than trying to reduce cloud usage uniformly across all services.

Not all phenomena follow the 80-20 rule, but it’s common enough in the systems we encounter that it’s a good rule of thumb. The technical term for it is the Pareto principle, and distributions that exhibit this 80-20 phenomena are an example of Pareto distributions, also known as power law distributions.

A common implicit assumption is that availability problems are Pareto-distributed. If you look at incidents and keep track of their causes, you should be able to identify a small number of causes that lead to the majority of the incidents. Because of this, we should attribute a cause to an incident, and then look to see which causes most often contribute to incident in order to identify interventions that will have the largest impact: those 20% of improvements that should yield 80% improvements. If you believe in the RCA (root cause analysis) model of incidents, that’s a reasonable assumption to make: identify the root cause of each incident, track these across incidents, and then invest in projects that attack the most expensive root causes.

If we can identify the problematic red dots, we can achieve significant improvements

But if you’re a frequent reader of this blog, you know there’s an alternative model of how incidents come to be. I’m fond of referring to the alternative as the LFI (learning from incidents) model. However, in the safety science research community this alternative model is more commonly associated with terms such as the New View, the New Look, or Safety-II.

The contrast of the LFI model with the RCA model is captured well in Richard Cook’s famous monograph, How Complex Systems Fail:

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident.

In this model, incidents don’t happen because of a single cause: rather, it’s through the interaction of multiple contributors.

If incidents stem from problematic interactions rather than problematic components, then focusing on components will lead us astray

Under an incident model where incidents are a result of interactions, we wouldn’t expect there to be a Pareto distribution of causes. This means that if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents. Instead, by looking at the components instead of the interactions, we’re unlikely to see much of any pattern at all.

Turning back to How Complex Systems Fail again:

Complex systems are heavily and successfully defended against failure.

The high consequences of failure lead over time to the construction of multiple layers of defense against failure. These defenses include obvious technical components (e.g.
backup systems, ‘safety’ features of equipment) and human components (e.g. training, knowledge) but also a variety of organizational, institutional, and regulatory defenses (e.g. policies and procedures, certification, work rules, team training). The effect of these measures is to provide a series of shields that normally divert operations away from accidents.

All of these explicit and implicit components of the system work together to keep things up and running. You can think of this as a process that most of the time generates safety (or availability), but sometimes doesn’t. You can think of these failure cases as a sort of error or residual term, they’re the leftover, the weird cases at the edges of our system. I think treating incidents as an error term is a useful metaphor because we don’t fall into the trap of thinking about error terms as looking like Pareto distributions.

This doesn’t mean that there aren’t patterns of failure in our incidents: there absolutely are. But it means that the patterns we need to look for aren’t going to visible if we don’t ask the right questions. It’s the difference between asking “which services were involved?” and “what were the goal conflicts that the engineers were facing?“

Green is the color of complacency

Here are a few anecdotes about safety from the past few years.

In 2020, the world was struck by the COVID-19 pandemic. The U.S. response was… not great. Earlier in 2019, before the pandemic struck, the Johns Hopkins Center for Health Security released a pandemic preparedness assessment that ranked 195 countries on how well prepared they were to deal with a pandemic. The U.S. was ranked number one: it was identified as the most well-prepared country on earth.

With its pandemic playbook, “The U.S. was very well prepared,” said Eric Toner, senior scholar at the Johns Hopkins Center for Health Security. “What happened is that we didn’t do what we said we’d do. That’s where everything fell apart. We ended up being the best prepared and having one of the worst outcomes.”

On October 29, 2018, Lion Air Flight 610 crashed 13 minutes after takeoff, killing everyone on board. This plane was a Boeing 737 MAX, and a second 737 MAX had a fatal crash a few months later. Seven days prior to the Lion Air crash, the National Safety Council presented the Boeing Company with the Robert W. Campbell Award for leadership in safety:

“The Boeing Company is a leader in one of those most safety-centric industries in the world,” said Deborah A.P. Hersman, president and CEO of the National Safety Council. “Its innovative approaches to EHS excellence make it an ideal recipient of our most prestigious safety award. We are proud to honor them, and we appreciate their commitment to making our world safer.”

On April 20th, 2010, an explosion on the Deepwater Horizon offshore drilling rig killed eleven workers and led to the largest marine oil spill in the history of the industry. The year before, the U.S. Minerals Management Service issued its SAFE award to Deepwater Horizon:

MMS issued its SAFE award to Transocean for its performance in 2008, crediting the company’s “outstanding drilling operations” and a “perfect performance period.” Transocean spokesman Guy Cantwell told ABC News the awards recognized a spotless record during repeated MMS inspections, and should be taken as evidence of the company’s longstanding commitment to safety.

When things are going badly, everybody in the org knows it. If you go into an organization where high-severity incidents are happening on a regular basis, where everyone is constantly in firefighting mode, then you don’t need metrics to tell you how bad things are: it’s obvious to everyone, up and down the chain. The problems are all-too-visible. Everybody can feel them viscerally.

It’s when things aren’t always on fire that it can be very difficult to assess whether we need to allocate additional resources to reduce risk. As the examples above show, absence of incidents do not indicate an absence of risk. In fact, these quiet times can lull is into a sense of complacency, leading us to think that we’re in a good spot, when the truth is that there’s a significant risk that’s hidden beneath the surface.

Personally, I don’t believe it’s even possible to say with confidence that “everything is ok with right now”. As the cases above demonstrate, when things are quiet, there’s a limit to how well we can actually assess the risk based on the kinds of data we traditionally collect.

So, should you be worried about your system? If you find yourself constantly in firefighting mode, then, yes, you should be worried. And if things are running smoothly, and the availability metrics are all green? Then, also yes, you should be worried. You should always be worried. The next major incident is always just around the corner, no matter how high your ranking is, or how many awards you get.

The perils of outcome-based analysis

Imagine you wanted to understand how to get better at playing the lottery. You strike upon a research approach: study previous lottery winners! You collect a list of winners, look them up, interview them about how they go about choosing their numbers, collate this data, identify patterns, and use these to define strategies for picking numbers.

The problem with this approach is that it doesn’t tell you anything about how effective these strategies actually are. To really know how well these strategies work, you’d have to look at the entire population of people who employed them. For example, say that you find that most lottery winners use their birthdays to generate winning numbers. It may turn out, that for every winning ticket that has the ticket holder’s birthday, there are 20 million losing tickets that also have the ticket holder’s birthday. To understand a strategy’s effectiveness, you can’t just look at the winning outcomes: you have to look at the losing outcomes as well. The technical term for this type of analytic error is selecting on the dependent variable.

Here’s another example of this error in reasoning: according to the NHTSA, 32% of all traffic crash fatalities in the United States involve drunk drivers. That means that 68% of all traffic crash fatalities involve sober drivers. If you only look at scenarios that involve crash fatalities, it looks like being sober is twice as dangerous as being drunk! It’s a case of only looking at the dependent variable: crash fatalities. If we were to look at all driving scenarios, we’d see that there are a lot more sober drivers than drunk drivers, and that any given sober driver is less likely to get into a crash fatality than a given drunk driver. Being sober is safer, even though sober drivers appear more often in fatal accidents than drunk drivers.

Now, imagine an organization that holds a weekly lottery. But it’s a bizarro-world type of lottery: if someone wins, then they receive a bad outcome instead of a good one. And the bad outcome doesn’t just impact the “winner” (although they are impacted the most), it has negative consequences for the entire organization. Nobody would willingly participate in such a lottery, but everyone in the organization is required to: you can’t opt out. Every week, you have to buy a ticket, and hope the numbers you picked don’t come up.

The organization wants to avoid these negative outcomes, and so they try to identify patterns in how previous lottery “winners” picked their numbers, so that they can reduce the likelihood of future lottery wins by warning people against using these dangerous number-picking strategies.

At this point, the comparison to how we treat incidents should be obvious. If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

The real-world case is even worse than the lottery case: lotteries really are random, but that way that people do their work isn’t; rather, it’s adaptive. People do work in specific ways because they have found that it’s an effective way to get stuff done given that the constraints that they are under. The only way to really understand why people work the way they do is to understand how those adaptations usually succeed. Unless you’re really looking for it, you aren’t going to be able to learn how people develop successful adaptations if you only ever examine the adaptations when they fail. Otherwise, you’re just doing the moral equivalent of asking what lottery winners have in common.

The problem with invariants is that they change over time

Cliff L. Biffle blogged a great write-up of a debugging odyssey at Oxide with the title Who killed the network switch? Here’s the bit that jumped out at me:

At the time that code was written, it was correct, but it embodied the assumption that any loaned memory would fit into one region.

That assumption became obsolete the moment that Matt implemented task packing, but we didn’t notice. This code, which was still simple and easy to read, was now also wrong.

This type of assumption is an example of an invariant, a property of the system that is supposed to be guaranteed to not change over time. Invariants play an important role in formal methods (for example, see the section Writing an invariant in Hillel Wayne’s Learn TLA+ site).

Now, consider the following:

Our systems change over time. In particular, we will always make modifications to support new functionality that we could not have foreseen earlier in the lifecycle of the system.
Our code often rests on a number of invariants, properties that are currently true of our system and that we assume will always be true.
These invariants are implicit: the assumptions themselves are not explicitly represented in the source code. That means there’s no easy way to, say, mechanically extract them via static analysis.
A change can happen that violates an assumed invariant can be arbitrary far away from code that depends on the invariant to function properly.

What this means is that these kinds of failure modes are inevitable. If you’ve been in this business long enough, you’ve almost certainly run into an incident where one of the contributors was an implicit invariant that was violated by a new change. If you’re system lives long enough, it’s going to change. And one of those changes is eventually going to invalidate an assumption that somebody made long ago, which was a reasonable assumption to make at the time.

Implicit invariants are, by definition, impossible to enforce explicitly. They are time bombs. And they are everywhere.

When there’s no gemba to go to

I’m finally trying to read through some Toyota-related books to get a better understanding of the lean movement. Not too long ago, I read Sheigo Shingo’s Non-Stock Production: The Shingo System of Continuous Improvement, and sitting on my bookshelf for a future read is James Womack, Daniel Jones, and Daniels Roos’s The Machine That Changed the World: The Story of Lean Production.

The Toyota-themed book I’m currently reading is Mike Rother’s Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results. Rother often uses the phrase “go and see”, as in “go to the shop floor and observe how the work is actually being done”. I’ve often heard lean advocates use a similar phrase: go the gemba, although Rother himself doesn’t use it in his book. There’s a good overview at the Lean Enterprise Institute’s web page for gemba:

Gemba (現場) is the Japanese term for “actual place,” often used for the shop floor or any place where value-creating work actually occurs. It is also spelled genba. Lean Thinkers use it to mean the place where value is created. Japanese companies often supplement gemba with the related term “genchi gembutsu” — essentially “go and see” — to stress the importance of empiricism.

The idea of focusing on understanding work-as-done is a good one. Unfortunately, in software development in particular, and knowledge work in general, the place that the work gets done is distributed: it happens wherever the employees are sitting in front of their computers. There’s no single place, no shop floor, no gemba that you can go to in order to go and see the work being done.

Now, you can observe the effects of the work, whether it’s artifacts generated (pull requests, docs), or communication (slack messages, emails). And you can talk to people about the work that they do. But, it’s not like going to the shop floor. There is no shop floor.

And it’s precisely because we can’t go to the gemba that incident analysis can bring so much value, because it allows you to essentially conduct a miniature research project to try to achieve the same goal. You get granted some time (a scarce resource!) to reconstruct what happened, by talking to people and looking at those work products generated over time. If we’re good at this, and we’re lucky, we can get a window into how the real work happens.

What if everybody did everything right?

In the wake of an incident, we want to answer the questions “What happened?” and, afterwards, “What should we do differently going forward?” Invariably, this leads to people trying to answer the question “what went wrong?”, or, even more specifically, the two questions:

What did we do wrong here?
What didn’t we do that we should have?

There’s an implicit assumption behind these questions that because there was a bad outcome, that there must have been a bad action (or an absence of a good action) that led to that outcome. It’s such a natural conclusion to reach that I’ve only ever seen it questioned by people who have been exposed to concepts from resilience engineering.

In some sense, this belief in bad outcomes from bad actions is like Aristole’s claim that heavier objects fall faster than lighter ones. Intuitively, it seems obvious, but our intuitions lead us astray. But in another sense, it’s quite different, because it’s not something we can test by running an experiment. Instead, the idea that systems fail because somebody did something wrong (or didn’t do something right) is more like a lens or a frame, it’s a perspective, a way of making sense of the incident. It’s like how the fields of economics, psychology, and sociology act as different lenses for making sense of the world: a sociological explanation of a phenomenon (say, the First World War) will be different from an economic explanation, and we will get different insights from the different lenses.

An alternative lens for making sense of an incident is to ask the question “how did this incident happen, assuming that everybody did everything right?” In other words, assume that everybody whose actions contributed to the incident made the best possible decision based on the information they had, and the constraints and incentives that were imposed upon them.

Looking at the incident from this perspective will yield will very different kinds of insights, because it will generate different types of questions, such as:

What information did people know in the moment?
What were the constraints that people were operating under?

Now, I personally believe that the second perspective is strictly superior to the first, but I acknowledge that this is a judgment based on personal experience. However, even if you think the first perspective also has merit, if you truly want to maximize the amount of insight you get from a post-incident analysis, then I encourage you to try to the second perspective as well. Make the claim “Let’s assume everybody did everything right. How could this incident still have happened?” I guarantee, you’ll learn something new about your system that you didn’t know before.

Book Review: Trust in Numbers

Trust in Numbers: The Pursuit of Objectivity in Science and Public Life by Theodore Porter, Distinguished Professor Emeritus of History, UCLA.

There are two general approaches to decision-making. One way is to make a judgment call. Informally, you could call this “trusting your gut”. Formally, you could describe this as a subjective, implicit process. The other way is to use an explicit approach that relies on objective, quantitative data, for example, doing a return-on-investment (ROI) calculation on a proposed project to decide whether to undertake the project. We use the term rigorous to describe these type of approaches, and we generally regard them as superior.

Here, Porter argues that quantitative, rigorous decision-making in a field is not a sign of its maturity, but rather its political weakness. In fields where technical professionals enjoy a significant amount of trust, these professionals do decision-making using personal judgment. While professionals will use quantitative data as input, their decisions are ultimately based on their own subjective impressions. (For example, see Julie Gainsburg’s notion of skeptical reverence in The Mathematical Disposition of Structural Engineers). In Porter’s account, we witnessed an increase of rigorous decision-making approaches in the twentieth century because of a lack of trust in certain professional fields, not because the quantitative approaches yielded better results.

It’s only in fields where the public does not grant deference to professionals that they are compelled to use explicit, objective processes to make the decisions. They are forced to show their work in a public way because they aren’t trusted. In some cases, a weak field adopts rigor to strengthen itself in the eyes of the public, such as experimental psychology’s adoption of experimental rigor (in particular, ESP research). Most of the case studies in the book come from areas where a field was compelled to adopt objective approaches because there was explicit political pressure and the field did not have sufficient power to resist.

In some cases, professionals did have the political clout to push back. An early chapter of the book discusses a problem that the British parliament wrestled with in the late nineteenth century: unreliable insurance companies that would happily collect premiums but then would eventually fail and would hence be unable to pay out when their customers submitted claims. A parliamentary committee formed and heard testimony from actuaries about how the government could determine whether an insurance company was sound. The experienced actuaries from reputable companies argued that it was not possible to define an objective procedure for assessing the a company. They insisted that “precision is not attainable through actuarial methods. A sound company depends on judgment and discretion.” They were concerned that a mechanical, rule-based approach wouldn’t work:

Uniform rules of calculation, imposed by the state, might yield “uniform errors.” Charles Ansell, testifying before another select committee a decade earlier, argued similarly, then expressed his fear that the office of government actuary would fall to “some gentlemen of high mathematical talents, recently removed from one of our Universities, but without any experience whatever, though of great mathematical reputation.” This “would not qualify him in any way whatever for expressing a sound opinion on a practical point like that of the premiums in a life assurance.”
Trust in Numbers, pp108-109

Porter tells a similar story about American accountants. To stave off having standardized rules imposed on them, the American Institute of Accountants defined standards for its members, but these were controversial. One accountant, Walter Wilcox, argued in 1941 that “Cost is not a simple fact, but is a very elusive concept… Like other aspects of accounting, costs give a false impression of accuracy.” Similarly, when it came to government-funded projects, the political pressure was simply too strong to defer to government civil engineers, such as the French civil engineers who had to help decide which rail projects should be funded, or the U.S. Army Corps of Engineers who had to help make similar decisions about waterway projects such as dams and reservoirs. In the U.S., they settled on a cost-benefit analysis process, where the return on investment had to exceed 1.0 in order to justify a project. But, unsurprisingly, there were conflicts over how benefits were quantified, as well as over how to classify costs. While the output may have been a number, and the process was ostensibly objective, because it needed to be, ultimately these numbers were negotiable and assessments changed as a function of political factors.

In education, teachers were opposed to standardized testing, but did not have the power to overcome it. On the other hands, doctors were able to retain the use of their personal judgment for diagnosing patients. However, the regulators had sufficient power that they were able to enforce the use of objective measures for evaluating drugs, and hence were able to oversee some aspect of medical practice.

This tug of war between rigorous, mechanical objectivity and élite professional autonomy continues to this day. Professionals say “This requires private knowledge; trust us”. Sometimes, the public says “We don’t trust you anymore. Make the knowledge public!”, and the professionals have no choice but to relent. On the subject of whether we are actually better off when we trade away judgment for rigor, Porter is skeptical. I agree.