The Complexities of “Failing Fast”

The Complexities of “Failing Fast”

There has been lots of talk recently about “failing” and “failing fast”. This is actually an extremely complex topic that can be difficult to grasp.

It’s not actually about failing.

The topic of “failing” is actually all about learning. When we think about the properties of learning, we can break it down into various properties: knowledge, skills, experience and arguably the most important thing, attitude.

When you think of these as seperate entities, our attitude of learning varies for each of the other properties of knowledge, skills and experience.
When thinking about each property and how failure can relate to them, the concept of failure is completely different between them.

For example, with knowledge, we can gain knowledge through various means – a common avenue for gaining knowledge is reading books and blogs. Obviously reading a book or blog allows us to gain knowledge and learn – even if it’s learning that we disagree or can’t use what’s being said in the book or blog. But if we think about failure relating to this activity of reading, what would that look like? Not reading the book or blog? Or half reading it?… The word “failure” feels like it doesn’t quite fit right with this kind of learning.

Our attitude to failure tend to apply more in a context of gaining knowledge through practice, experimentation and decision making.

“Safe to fail” and “fail fast and often”… But what about the different types an severities of failures?

Lets hone in on this for a second. When we practice or experiment, there seems to be a push to “make it safe to fail” and to “fail fast” or “fail often” but this misses taking into account the variables in the types and scales of failure… Some of which can be very detrimental and even harmful, which we should probably avoid at all costs!

There are many different variables relating to the types of failure:

Experimentation Failure – This is when what you planned for the experiment doesn’t work out how you wanted it to (i.e. it’s unsuccessful). This may be safe or unsafe depending on context and how big the experiment is.This experimentation failure could possibly have a knock on effect for…

Failure For The Business – How much does it harm the org. There is definitely an acceptability scale here, and sure, there is an objective to make it “safe”, but there are a huge amount of unknowns here, and as a human race, we’re inherently pretty bad at uncovering and assessing risk.

Failure in learning (from 1 & 2) – by this I mean failing to gain any knowledge, experience or skills after you’ve previously experienced failure (at experiment or business levels)In some ways, this type of failure is worse for us at a personal level, as it can lead to personal reputational damage if we were to make the same mistakes over and over again without learning from them.

This type of failure can be common in some environments. Especially when there are time pressures… Time might be afforded to do the experiment, but not afforded to reflect on the failures or the lessons from the experiment.

In fact, the fact there are continued time pressures is a sign that the company might not have learned from their past failures too).

Related to this is…

Failure in thinking about the consequences – This actually relates to failure prevention as much as it relates to learning from the failure.If you have no awareness of the possible consequences of the experiment failing, then there is a higher risk of being unable to move out of the way of the bigger failures relating to the business.

Another way to describe this is if you fail to learn about a possible failure before it happens in order to act quickly to prevent it.

I’ve seen all of these kinds of failure personally and within many companies that I have worked with in the past.

One company that I used to work with was experimenting with an org change. There were some big “red flag” concerns and lots of confusion that came out part 1 of the experiment, which was at a small scale with one team. But the company actively didn’t take on-board those concerns and risks and ploughed ahead with part 2 of the experiment, which was to scale it up to other teams.
At that point, everyone could see the failure coming like a double decker bus speeding towards you from 10 miles up the road – we knew we were in danger, but we kept standing in the middle of the road, just waiting for the bus… watching it get closer. Prepping ourselves for the impact. The reason that no corrective action was taken (aka no quick learning occurred), was because the company kept on saying: “It’s fine – failure is ok. There’s no need to get out of the way of the bus… Let’s see what happens when it hit us and them we can make our next decision after that”.

The final outcome? The metaphorical bus hit hard. The experimental failure caused a business failure as lots of people ended up resenting the company and leaving. This in turn affected the product and ultimately caused reputational damage to the company too. On top of that, the company struggled to re-hire people as it had also gained a bad rep within the software development world too. Double whammy… 😦

Different severities based on different contexts.

Context also plays a big part in failing and the levels of acceptability to fail, and the severity scales of failures too.

If you are working on medical software, aeroplane software, banking software, government software, etc… then experimenting in this context should be treated completely differently compared to working on a small independent mobile app that’s for entertainment.

In certain contexts, more emphasis is certainly needed on investigating the risks and effects of failing to raise awareness

Let’s focus the conversations on learning.

Having a focus on making our failures safe doesn’t often take into account (or at least make explicit) the boundaries of acceptability or the risks surrounding failure at all levels (beyond the experiment itself), so I would suggest we shift the conversations to focus on learning, and making learning safe.

Don’t just think about the positive things you could learn within the knowledge, skills or experience you could gain through experiments and prqctice, but think about what we can learn from the possible failures too (see what I did there? I didn’t say think about the failures, but I said think about what we can learn from the possible failures). Having an awareness of the risks of failure – actually putting some focus on learning about these possible risks allows us to

This focus on learning helps us to think of the risks and the possibilities of failures. Ask yourself what would you learn from failure and then ask if and how you can learn that lesson sooner that the failure itself.

Also ask yourself what the wider impact and consequences would be from the failure (from failing an experiment, to the knock on effects regarding business failure, and even at that personal level too). And think about whether the learning opportunities justify accepting the risks and dealing with those possible consequences or whether they provide an insight in allowing us to make a course corrective, preventive decision regarding the experiment.

Only when we understand what it means to learn and the possibility of our experiments failing at this deeper level of understanding, can we then start to think about how to reduce the risks and consequences to be able to answer the questions: do we need to fail in order to learn these lessonsand is there a safer way to learn quicker.

Oddly, I’m seeing a lot of posts in different feeds today about stress and failure. (I’m assuming this is just coincidence…)

I’ve been in failure situations before. No-one died. And in the first instance, the failure caused some temporary reputational exposure for the organisation, which was in part my fault because I cut a particular corner but didn’t recognise the unforeseen consequences of cutting that corner. A meltdown followed. Fortunately, the organisation stood back, reviewed what had happened, and took some steps to prevent similar problems in future. We all learnt from that failure, and not just in technical matters.

In another situation more recently, failure came about because a particular project wasn’t properly defined at the outset. Development and testing went ahead without our knowing that these structural issues existed. Only when the product was released into the real world did the problems become clear – by which time, other people were promoting the product and making what was planned as a gradual beta release snowball even as problems were coming out of the woodwork. There were severe cashflow implications and some reputational damage, but the responsible people had long since gone on to something else. We did learn from this and took much more care on the next project to get requirements gathering and early engagement with the business right, but by then it was too late. The company’s owners decided for us that “deputy heads must roll” and in fact I lost my job because of it (guilt by association). My immediate managers, indeed, even the on-site senior management team, understood what had gone wrong and where; but the owners, who were remote and only cared about the bottom line, took a different attitude.

“Failing fast” and treating failure as a learning process does rather depend on the lessons that you take from that failure. If you take positive lessons from it, then failing fast, limiting the fallout and learning the lessons is a good thing. But if someone is determined to find culprits and trash both the project that failed and anything else that team worked on, then you are not in a good place.

With the first scenario, it sounds like you and your company put a focus on learning, and that’s great.
With that company, when you said that your company “took some steps to prevent similar problems in future” – were the steps technical steps to prevent the exact problem from occurring again (as in detecting that problem early).
Or did you also think about (and put in place measures on) how you could learn such lessons about the risks sooner, before the failure occurs for any other similar types of risks but not necessarily the same risk that manifested in this scenario?

That’s what I’m trying to say with this blog. Failure is good to learn from, but faster learning feedback on the potential lessons from thinking about the potential risks of possible failures might allows us to learn much quicker about the same lesson without us having to fail.

Additionally, where your second company that let you down, it sounds like it wasn’t safe to learn…
But additionally, this also relates to the scales and types of failure that I mentioned. Maybe the company didn’t afford enough time to investigate the potential risks before pushing forward with the activity that caused the failure?
Or maybe they were pushing a message of “it’s safe to fail” so that investigation was deemed as unnecessary, but the reality was that a smaller scale of failure relating to experimentation would have been safe, but not this knock on effect to cause the business failure, and that’s what was deemed unacceptable…?
I wonder… If the investment was there to spend some time trying to investigate possible risks – to try and learn sooner from “potential failure” rather than the slower feedback cycle of experiencing the failure, then it may have been a different outcome with that company.

In the first instance it was more about changing the way that particular job was done (it wasn’t specifically a systems development or testing job, but it was something we were going to be doing annually for the [then] foreseeable future, so getting the process right was the best way of fixing things); it was also the first time we’d done that particular exercise, so the whole thing was going to involve learning points for the future. And when I say it involved some meltdown, I actually meant “I was taken out of the office feet first and put in an ambulance”, which rather made the management think a little bit more about workloads and looking after staff welfare a bit more.

In the second case, years later, I think you’re overthinking it and making the error of thinking that the company owners were people who actually cared anything about what was done in the workplace.

I’m talking about a multi-million pound turnover, third party services provider company that (at the time) did its own in-house IT development work. However, the company was owned by a venture capitalist company based in London, for whom my company was just one entry in a list of portfolio investments. Whilst my own immediate line management was quite enlightened and had a realistic view of “success” and “failure” in IT projects, our in-house senior management had never been exposed to modern thinking on IT project management. They were actually quite enlightened in other ways, but were a bit new to more flexible management. Unfortunately, when the first project went bad and began to impact the bottom line, all the enlightenment in the world couldn’t make up for the venture capitalists who just moved in, booted out some Board members, and put in place measures to drive down costs and drive up profits. The sort of analytical approach you’re thinking about was not on their radar because they were just hatchet men with no interest in thinks like feedback cycles, investigating risk, or even ‘pushing messages’. They didn’t care what was done on a day-to-day basis, only whether this year’s profit numbers were more than last year’s.

In the first example, the organisation did both. They instituted a policy of second eyes on a project so that critical work didn’t go out unchecked. And because they also took time to step back and think a bit more broadly about the sequence of events, they realised that I’d been doing something very new and complex all by myself for four months solid and my manager had just let me get on with it; and when managers looked at it (whilst I was away recuperating), they realised just how much stress I’d been putting myself under. So they instituted awareness training for managers about workplace stress so as to say “watch out for your people doing too much, because it can go bad on all sides”.

As for the second instance – no, they certainly weren’t pushing a “safe to fail” message! It was a classic Type A management model and I suspect there was no risk assessment and no expectation that any would be done. The purges started at the top with the CEO and spread downwards through the organisation. I was never asked for my view as to what happenbed, and the only lesson learnt was “we’re better off buying in proprietory apps” (and then, “hey, why don’t we buy the company that writes the apps?”)

I’m wondering if the difference in those who value ‘learning fast’ and those who value ‘failing fast’ is simple a difference of context. Let me start with what Rumsfeld said about un/known un/knowns:

“Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history… it is the latter category that tend to be the difficult ones.” – Donald Rumsfeld, https://en.wikipedia.org/wiki/There_are_known_knowns

Could it be that for those who fail fast are thinking about businesses that have more unknown unknowns, where results are really difficult to predict? Failing fast within the context of a startup is trying some vague concept of an idea that has never been tried before, with only guesses to what the customer wants much less how precisely to build the product. This is why pivots in startups matter so much, as unknown unknowns become known unknowns. Where some aspect of the idea are reasonably good but some aspect was wrong. When starting, which, if any, aspect will be wrong is an unknown unknown. Whereas, learning often comes from known unknowns. You know you don’t know something, you go discover the answer. This assumes there is a clear mission or objective, some piece of data you need to find.

What history seems to suggest is that unknown unknowns get you into trouble some large amount of the time, but once in a while generate some insanely profitable value. The so called ‘unicorns’ of SV fit into that perspective really well. No one knew they wanted to be able to call a taxi via their phone using an app in 2007. But within two years of the iphone showing up, Uber came into being. It’s easy to look back and say it was obvious, but at the time the business started up, it was unknown if anyone would want it and unknown how exactly it would work. They started by creating a limo service. It’s a little less obvious that it would then become a taxi service using an on-demand network of drivers, but that’s because they were exploring the space. This to me is what failing fast is about.

You might argue that all of this is simply known unknowns, that is to say they knew that they knew nothing about how the company would really work. I can see why you might argue that, but in that case, as soon as you know of a risk, it’s just a matter of studying it. e.g. The next plague may occur, but in knowing that it might, I now know it’s a unknown thus it’s become a known unknown. I think that might be true, but violates the spirit of the point. A claim of ‘knowledge’ around a big enough surface area with a broad enough claim doesn’t mean it’s really known, it’s just a vague, abstract awareness.

Learning fast is, in my mind, about creating the right experiments to gather results or taking lessons away from results. Failing fast is about exploring an environment and seeing what is/isn’t working so you can pivot and explore another area.

Sorry if this is a bit rambling, it was my own effort to explore the space.

And then there’s th4e class of “unknown unknowns” which are unknowable; your example of Uber coming out of nowhere is an excellent one. It makes me think of this:

“If human thought is a growth, like all other growths, its logic is without foundation of its own, and is only the adjusting constructiveness of all other growing things. A tree cannot find out, as it were, how to blossom, until comes blossom-time. A social growth cannot find out the use of steam engines, until comes steam-engine time.” (Charles Fort)

No-one could think of Uber until the right conditions existed for it. This is a speciual example of learning fast: seeing the potential where others don’t. And that’s where learning crosses over into innovation.

Interestingly, fail fast started as a software term to describe creating indicators early on that users did something wrong rather than having the software blow up at the end of a run. If wiki is right, it appears that term was invented in 1985: https://en.wikipedia.org/wiki/Fail-fast

The point of failing fast is to lower the expense of mistakes by early notification. The point of learning fast is to not make the same mistakes again and again. I think both are useful. For example, stopping an experiment early because the answer is clear, is valid. It’s like ending a testing session when the build doesn’t pass the smoke tests–sure you could test more, but why bother, fail the build fast so the developer can jump on that right away. The learning aspect is in figuring out why you made such a terrible build and to not creating builds that fail smoke tests like that over and over again.

In a engineering sense, that’s why we started writing unit tests, so we can fail faster (before running smoke tests). It turns out unit tests also provide a side benefit of improving designs*, so it provides learning opportunities as well. So I see how the two concepts are interrelated, but I hesitate to deprecate failing fast when its goal should be complimentary to learning fast rather than a pure replacement.

I hope that clarifies my point.

JCD

* This statement is still debated in the development community but that is out of scope of the topic.