Written by a group of leading error researchers, this was by no means an "easy read" as it required significant attention to fully understand the arguments and implications, but I found it very rewarding and almost immediately applicable to my day-to-day work within a very large enterprise IT organization that is rapidly shifting to the *-as-a-Service delivery model. The book uses a number of fascinating (and occasionally terrifying) real-world examples from aviation, medicine, nuclear power, space program, and others to explain that the act of attributing a failure to "human error" does not do anything to explain why the failure happened, nor does it generally lead to any constructive responses or improvements.
Below are 3 ideas from the book that I found particularly useful and insightful.
#1: Goal Conflicts:
"Perhaps the most common hazard in the analysis of incidents is the naive assessment of the strategic issues that confront practitioners."
When investigating a failure, it is crucial to recognize that system operators are often dealing with multiple, competing goals. Operators must regularly assess and resolve these goal conflicts by making trade-off decisions which necessarily involve risk, and often these decisions frequently must be made under time pressure. Operators are normally able to skillfully and successfully balance these conflicting goals and risks as part of their daily routine. Failures occur when the risks are unsuccessfully balanced, but that does not mean the operators were not skillful. In many cases, the actions which led to failure were the exact same actions which previously led to to success. As investigators, we need to fully understand these goal conflicts in order to avoid hindsight bias and in order to improve the team's strategies for assessing and balancing risks. Also, it is important to understand that a team's strategies for balancing risks must evolve as do changes in the operating context, and in the related goals and risks. Putting this into practice, my team recently had a postmortem discussion about an outage that involved unplanned changes to a production system. During the discussion, it was enlightening to discuss our goal conflicts and our decision-making process around making unplanned changes. We determined that eliminating unplanned changes was not the right course of action - in fact unplanned changes are sometimes essential. Instead we refined our decision making process for unplanned changes to the production system; and acknowledging that our operating context is subject to change, we set a checkpoint to re-assess this process 3 months later.
#2: Distancing through Differencing
"Do not discard other events because they appear on the surface to be dissimilar. At some level of analysis, all events are unique; while at other levels of analysis, they reveal common patterns."
"The obstacles to learning from failure are nearly as complex and subtle as the circumstances that surround a failure itself. Because accidents always involve multiple contributors, the decision to focus on one or another of the set, and therefore what will be learned, is largely socially determined."
I was fascinated by one of the book's case studies, that of a chemical fire that occurred during routine machine maintenance in a high-tech product manufacturing plant in the US. This company was one that took safety very seriously, with good working conditions, significant investment in safety, and strong motivation to examine all accidents promptly and thoroughly.
"The manufacturer had an extensive safety program that required immediate and high-level responses to an incident such as this, even though no personal injury occurred and damage was limited to the machine involved. High-level management directed immediate investigations, including detailed debriefings of participants, reviews of corporate history for similar events, and a “root cause” analysis. Company policy required completion of this activity within a few days and formal, written notification of the event and related findings to all other manufacturing plants in the company. The cost of the incident may have been more than a million dollars."
The company's investigation of this accident focused on the machine, the maintenance procedures, and the operators who performed the maintenance and identified multiple deficiencies that were corrected quickly. The fascinating part of this case study was that a broader review by outside investigators found that a very similar chemical fire had occurred in one of the company's other manufacturing plants in another country earlier that same year, and that this prior event was well known by practitioners at the US plant. Both the practitioners and the internal investigators considered the prior event to be irrelevant because it had occurred in a non-US plant with a different safety system to contain fires and involved a different model of the machine. Later, the accident occurred again in the US plant, this time during a different shift, and this third event was rationalized as having been due to lower skill level of the workers in that shift. The authors use the term "Distancing through Differencing" to label the tendency of organizations and individuals to distance ourselves from failures (i.e. "that could never happen here"). My takeaway is that there is a great opportunity across the many teams now providing cloud services in enterprise IT organizations such as my own to share details about each other's failures, look for the general patterns, and avoid repeating those incidents that have occurred within other services.
#3: Design-induced failures
"Automation surprises begin with miscommunication and misassessments between the automation and users, which lead to a gap between the user’s understanding of what the automated systems are set up to do, what they are doing, and what they are going to do."
The book contains several chapters devoted to the ways in which the design of computer systems used by operators can induce failures. These chapters detail several different aspects of this issue which is vitally important to enterprise IT as both a technology provider and as a technology consumer. Among the many points raised here was that automation often introduces new burdens on the same operators that it is intended to assist. I have seen this principle in action when teams implement automation to accomplish manual tasks but unfortunately do so in a way that does not provide users/operators with sufficient feedback to understand what is going on when it doesn't work. This is an example of automation is written without regard for the users, and it can add significant complexity and brittleness to the system.
- Capa comum: 292 páginas
- Editora: CRC Press; Edição: 2 (28 de agosto de 2010)
- Idioma: Inglês
- ISBN-10: 9780754678342
- ISBN-13: 978-0754678342
- ASIN: 0754678342
- Dimensões do produto: 15,6 x 1,7 x 23,4 cm
- Peso de envio: 540 g
- Avaliação média: Seja o primeiro a avaliar este item