How to learn from failure

, , , ,
broken blue ceramic plate

Reading time 13 minutes

Below are my personal notes from Amy Edmundson excellent article Strategies for learning from failure. It’s a long read but I highly recommend it over my notes as it goes into a lot more detail then I have covered.

Summary

Not all failures are the same and categorisation of failures can make a big difference in enabling learning from them.

Why should testers care?

Considering we deal with software failure all the time we have a tendency to forget the human cost of failures. Especially in terms of how that failure occurred (the team), how that failure affects the users and the outcome for the business. This article is a great introduction in how we can learn from failure first and then how we could enable our teams and business to learn from them by reframing errors as different types of failure.

[Organisations] that catch, correct, and learn from failure before others do will succeed

Amy Edmundson

Amy classifies failure into three types of categories

  • Preventable
  • Complex
  • Intelligent

But we have a tendency to view all failures as one type. In software testing we group them into different levels of risk but generally all failures are error. Which means something isn’t right and should be avoided. We’ve started to try and learn from them but the need for interdisciplinary teams to do so is a cost that is often too high to pay so doesn’t happen very often. I think if we focused our efforts to investigate complex failures we can use the learnings to start minimising preventable issues and stop some of the them happening altogether.

How should we respond to failure?

Some people believe that respond constructively to failures could give rise to an anything-goes attitude. They think that If people aren’t blamed for failures, then how else will they try as hard as possible to do their best work? But this has a tendency to try and avoid failure and in some cases cover it up.

What we actually need is culture that makes it safe to admit and report on failure (so we can learn from them) which coexist with high standards for performance (to make use of that learning to get better).

The blame game

If people see failure as something to be avoid you end up in the blame game. Which has a spectrum of reasons for failure from blameworthy to praiseworthy:

Blame game

🤔Notice how things that are blameworthy are about individuals but praiseworthy are all about the things.

I wonder how many time people don’t blame others but themselves for the failure and hence keeping quiet or downplaying issues when they occur?

To embrace failure we need to classify it better then the catch all term that failure encourages. Amy Edmundson suggest these three categories: preventable, complex and intelligent failures.

Preventable

  • These are usually found in routine tasks that are well defined and the outcomes are well understood
  • Preventable failures tend to occur when we deviate from this routine
  • In software engineering certain routine task can and should be automated. Such as build processes and specific types of checks
  • If they do need to be performed manually then tasks lists and check lists are well suited to these types of tasks
    • Note: exploratory testing falls under intelligent failures
  • Failures which result from these types of tasks can usually be mitigate through better understanding of the work we do, how we do it but most importantly why
  • When we spot these types of failures (deviation from the routine) we should immediately address them
  • This is in part about stopping errors from being passed down the process and building quality in

Complex failures

  • Many systems we work in are complex and too big for any one person and in most cases even groups of people to fully understand
  • This means complex systems can be unpredictable and ambiguous and fail in ways we could not have anticipated
  • The way in which complex failures occur can in some cases be traced to things all happening in just the right way
  • But assuming failures will never occur can be counter productive and we should build into the process to handle what happens when things go wrong
  • When complex failures do occur we should recognise them as such and investigated them in a praiseworthy way to understand all the components that led to the failure and identify if any of the smaller issues that resulted in the failure can be made preventable
    • For example
    • Most accidents in hospitals result from a series of small failures that went unnoticed and unfortunately lined up in just the wrong way.

Intelligent failures

  • Named by the Duke University professor of management Sim Sitkin as intelligent failures
  • These are the failures that occur during experimentation
  • They help you understand what works and what doesn’t
    • And importantly quickly
  • These are situations where the answers are not knowable in advanced
  • The only way you can find out is to actually do it
  • Exploratory testing is all about raising awareness of intelligent failures
  • As Amy Edmondson calls them they are failures at the frontier
    • Situations that haven’t happened before
    • Or maybe won’t happen again
  • For software engineering this is a lot of the work that we are doing
    • Hence agile software development so we can adapt to the changing environment
    • To do things in a way that helps you learn from your work
    • We should be producing lots of intelligent failures that help us learn about the system we’re building , the people that use it and the domain in which it used
    • Exploratory testing is all about exploring a system and seeing in what ways it can fail to better understand how it works

Small experiments over Big Bang experiments

At the frontier, the right kind of experimentation produces good failures quickly. Managers who practice it can avoid the unintelligent failure of conducting experiments at a larger scale than necessary.

Trail and failure?

“Trial and error” is a common term for the kind of experimentation needed in these settings, but it is a misnomer, because “error” implies that there was a “right” outcome in the first place.

Tolerance of failure

We need to be able to accept complex and intelligent failures and understand that doing so does not mean mediocrity. Tolerance is actually something that we need in order to be able to learn from these types of failures. The problem with failure is that there is almost always an emotional element to it and so needs leadership to enable the learning that needs to happen.

How do you learn from failure?

Leaders should insist that their organizations develop a clear understanding of what happened—not of “who did it”—when things go wrong.

This requires consistently:

  • reporting failures, small and large;
  • systematically analysing them; and
  • proactively searching for opportunities to experiment.

Anyone working on experimental work needs to clearly know that the faster we fail the faster we will succeed but most people don’t understand this subtle but important concept.

  • The quicker things fail the quicker you can pivot or try another idea that can succeed
  • But the longer that failure takes the longer you are executing on an idea that will not help your objective
  • What is the opportunity cost of working on one thing and not the other?

Some people may approach experimental work as if it’s well defined and understood such as production line style of work where you need to produce the same thing over and over.

For example, statistical processes control, which uses data analysis to assess unwarranted variances, is not good for catching and correcting random invisible glitches such as software bugs.


In a typical software team this would be predefined test cases or automated checks

There are three main ways to learn from failure: detection, analysis, and experimentation.

Detection

We need to detect and make issues visible earlier on in our processes before they become bigger issues later on

Don’t shoot the messenger

Unfortunately a lot of people are reluctant to raise issues early on in the process for all manor of reasons. The biggest culprit being people unwilling to take interpersonal risks in raising issues.

One of the best ways to combat this is for management to lead by example and not only encourage the raising of issues earlier on in the process no matter how small but also applauding the people that do and having a system in place to make something happen about it.

Another issue is a human tendency to not admit failure due to the stigma attached to it “it failed therefore I’ve failed”. Therefore people keep going hoping that things will get better when they should have admitted failure or worse they haven’t realised they’ve failed due to inadequate measures or goal when starting out.

Changing the stigma around failure is one way to improve the situation such as failure parties to encourage the reporting of failures and help people look at the situation in another way.

Example of how other organisations detect errors

Through speaking up supported by management from Amy Edmundson:

In researching errors and other failures in hospitals, I discovered substantial differences across patient-care units in nurses’ willingness to speak up about them. It turned out that the behavior of midlevel managers—how they responded to failures and whether they encouraged open discussion of them, welcomed questions, and displayed humility and curiosity—was the cause. I have seen the same pattern in a wide range of organizations.

Building quality in

The idea of the andon cord from the Toyota production system is doing just this; noticing small deviations in process and correcting them there and then to constantly improve the system.

For software engineering this is all about building quality into the process instead of inspecting it at the end. Inspecting at the end is almost too late to make difference due to the increased cost in time and cognitive load to make the change. This usually ends in discussion such as /users are never going to notice X/, /no one is ever going to do Y/ or /let’s see if it’s going to become a problem first/.

Analysis

Once failures have been detected it is important to not just look at the symptoms of the problem and move on but to dig into the root cause of the issues.

Unfortunately we tend to not want to do this as it can be painful to admit that something went wrong especially if we are the cause of it and can negatively affect our self esteem and confidence. There is also an element of interpersonal risk associated with admitting failure that can add towards people not wanting to spend too long looking at issues too deeply. “What if people think I’m incompetent?”

Culture is another aspect that needs to be in place for inquire into failure to occur. Digging into failures needs:

inquiry and openness, patience, and a tolerance for causal ambiguity

But a lot of organisational cultures are geared towards actions and results not reflection as needed for learning from failure.

We are also highly susceptible to fundamental attributes error. This is where we downplay our responsibility and blame external factors when we fail and do the opposite when others do.

Amy research back in 2010 showed that failure analysis is often limited and ineffective – sadly I think this is still the case for a lot of organisations.

Analysing complex failures is difficult as they tend to occur across teams and departments and due to the reason listed above most people only focus on the symptoms rather then getting at the underlying causes of the failures. Therefore it’s best to use multidisciplinary teams to carry out the investigation with the support of management that you are looking at what happened not what someone did or didn’t do.

From the NASA Colembine disaster

  • A team of leading physicists, engineers, aviation experts, naval leaders, and even astronauts devoted months to an analysis of the Columbia disaster.
  • They conclusively established not only the first-order cause: (symptom)
    • a piece of foam had hit the shuttle’s leading edge during launch—but also
  • second-order causes: (underlying reason)
    • A rigid hierarchy and schedule-obsessed culture at NASA made it especially difficult for engineers to speak up about anything but the most rock-solid concerns.

Experimentation

  • A critical activity for effective learning is strategically producing failures—in the right places, at the right times—through systematic experimentation.

For scientists
* 70% of experiments will fail
* They recognise that failure is not optional but a part of the process
* And that Failure holds valuable information that they need to extract and learn from /before the competition/ 🤔

In contrast when product companies design new products they plan for success. So they setup the product for optimal conditions that work instead of representative ones that they can actually learn from. Therefore the pilot only produced information about what does work not what doesn’t.

From Amy Edmundson:

  • A small and extremely successful suburban pilot had lulled Telco executives into a misguided confidence.
  • The problem was that the pilot did not resemble real service conditions: It was staffed with unusually personable, expert service reps and took place in a community of educated, tech-savvy customers.
  • But DSL was a brand-new technology and, unlike traditional telephony, had to interface with customers’ highly variable home computers and technical skills.
  • This added complexity and unpredictability to the service-delivery challenge in ways that Telco had not fully appreciated before the launch.
  • A more useful pilot at Telco would have tested the technology with limited support, unsophisticated customers, and old computers.
  • It would have been designed to discover everything that could go wrong—instead of proving that under the best of conditions everything would go right.
  • Of course, the managers in charge would have to have understood that they were going to be rewarded not for success but, rather, for producing intelligent failures as quickly as possible.
  • What incentives are you setting up for your employees? The things you reward are the things you will get.

What makes exceptional organisations?

exceptional organisations are those that go beyond detecting and analysing failures and try to generate intelligent ones for the express purpose of learning and innovating.

Can you think of any organisation that purposely inject failures into their system to see how they behave? Hint they named the tool after monkeys 🐒 and in the process created a whole new discipline: Chaos engineering. These experiments don’t have to be that big either:

[you] don’t have to do dramatic experiments with large budgets. Often a small pilot, a dry run of a new technique, or a simulation will suffice.

recognise the inevitability of failure in today’s complex work organizations. Those that catch, correct, and learn from failure before others do will succeed

Amy Edmundson
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *