Reliability: Two Mistakes High

Bala
4 min readFeb 26, 2021

--

This is a blog about — What Is “Two Mistakes High”? and How is it relevant to our IT industry?

  • What does it mean?
  • Why TWO mistakes?
  • Is it relevant to us?
  • How do we apply in an application?
  • Conclusion

I stumbled on this expression a couple of years back while reading this book. When I read it, it sounded OK; nothing too exciting or cool about that. Later, I noticed something; it was a seed that started to root stronger. Now I can’t get it out, Every time I make some decision about reliability/stability regardless of whether it is for work or for fixing something at home (be it electrical/plumping/carpentry/IoT project), I almost hear someone whispering in my ear “arre yooou flyying two mistakes high?”

Hold on ✋, before you call this paranoia, let’s see what does it mean to “Flying Two Mistakes High”?

What does it mean?

This is an expression used while kids are learning to fly remote control airplanes.

Well, when you learn to fly, you will be doing some maneuvers and trying to learn acrobatics. Of course, you will try out a stunt of some sort. And, quickly, you will learn this lesson: If you make a mistake, your plane will naturally lose some altitude.

And, you will see, mistakes equate to altitude.

So, keeping your plane “two mistakes high” means keeping it high enough that you have enough altitude to recover from two, independent mistakes.

Why TWO mistakes?

While you are recovering from the first mistake — and you are now already lower in altitude — what happens if you make another slip-up? If your plane isn’t high enough to recover from the second mistake, well, it’s bad news. And, if you lose too much altitude, you know what happens — broken toy at worst.

In that event, you always want to stay high enough so you can recover from a mistake, even while you are still recovering from the first mistake. As a result, you don’t crash, no matter what goes wrong.

Is it relevant to us (IT)?

We saw where the expression came from. And, this is a good analogy for maintaining availability in our most critical applications.

In our critical modern applications, It means that even when something is going wrong with our application, we want to be able to keep our application running reliably enough so that we can afford for something else to go wrong while we are still recovering from the main problem. Think about it: during our recovery process, we are typically stressed and perhaps in an tricky situation doing potentially ad-hoc things — just the type of situation that can cause us to make another mistake.

While I was researching more on this topic (just a glorified way of saying “I googled it”), I was able to find only a handful of info that relates this philosophy with the IT field — most of them are about how it can be related to availability.

But, to me, it is more than that. It applies to many more scenarios.

I believe, inherently, most of us are risk-takers and we would like to push the envelope now-and-then to test our limit (in this case, test our application limit — keeping error-budget in mind). I consider the R/C plane analogy as a meta-thinking tool. It makes you take calculated risks in any given situation.

It is a lesson about redundancy and it’s a lesson about resiliency and it is a lesson about …you get the idea. It effectively applies to modern application development, change management, operation. It even applies in many other different aspects: from dealing with hardware failures to, data redundancy, capacity planning, performing retries in your service calls, reducing toil, risk management, and disaster planning.

For those curious minds, who might ask, “Why stop at two? Why not three or more mistakes high? “

Short answer: To keep it simple, I don’t want to go Inception experience here to avoid Limbo. To start with, two sounds good enough.

How do we apply in an application?

For starters, when we identify the failure scenarios that we anticipate, we should walk through the ramifications of those scenarios and our recovery plan for them. We make sure the recovery plan itself does not have the potential for mistakes or other shortcomings built into it — in short, we check that the recovery plan can work, and it has backup for shortcomings.

Sounds simple, right?…. Big No, it is easier said than done. But we can practice wherever/whenever we should to make it a habit.

Conclusion

A few years back, when one of the applications was facing relatively high stability issues, my mentor gave me the advice to bring things under control. This is the gist of what he said: “Hey Bala, it is ALL about asking the right questions and there is no need to get overwhelmed”.

I can strongly say, “are you flying two mistakes high?” comes under that list of right-questions-to-ask.

For site reliability engineering, the word “mindset” is key. Being an effective SRE is as much about how you think as it is about your technical skills.
Kevin Casey

--

--

Bala
Bala

Written by Bala

An enthusiastic explorer, a passionate programmer, and a pragmatic architect with 15+ years of IT experience in the FinTech realm.

No responses yet