The Benefits of Fail-fast Systems

How do you handle an unexpected failure or system state?

Do you handle it gracefully and do a best-effort attempt at returning a response? (fail-safe)

Or, do you stop the failure in its tracks, let it blow up, and sound the alarms? (fail fast)

The general approach that has worked best for me is to always fail fast unless you’re close to the end-user.

The user interface (frontend) is the closest you can be to the end-user. When running into a failure, fatally crashing the UI is the absolute worst response. Any type of response would be better than a fatal crash: serving cached (but stale) data, partial responses, or a retry button. As a last resort, you should be showing a general “Whoops, something broke!” message.

It’s worth putting in the effort in determining the failure state combinations of our frontend to gracefully handle failures and ultimately deliver a good experience. A bad user experience causes churn and uninstalls.

However, in any other situation, failing fast is a vastly superior strategy to make your systems more robust.

Failing visibly makes bugs easier to find

When your service encounters an invalid state, failing fast means failing visibly. After halting execution, make it clearly visible that your system has encountered an invalid state.

Failing visibly makes defects much harder to miss. Even if you have logging in place (which you should have anyway), errors are much more likely to be noticed if all involved parties have it “blow up in their face”.

Easier debugging

Continuing execution after an invalid state makes your system much harder to debug. Instead of knowing exactly where execution stopped, you’ll now have to deal with stepping into your code, reproducing the scenario, and figuring out at what point your program diverged into the state.

Avoid cascading failures

Unless you’ve carefully controlled all possible continuation scenarios, allowing execution by failing safe means you’re effectively allowing your system to enter unknown territory. Unexpected invalid system states lead to more invalid states, and before you know it, they cascade into a much larger failure than if you had stopped it in time.

Less cognitive load and simple mental models

Failing safe means adding branches of possible code paths, resulting in more things to juggle around in your head and, consequently, a greater likelihood of mistakes.

Failing fast means predictable and deliberate programming. You write code confidently when you can rest assured that the system is in the state you expect it to be. By failing fast, you’ve effectively ruled out the possibility of unexpected states, allowing you to work with a simple mental model of your system.

Assertive Programming

There’s a whole software development methodology for fail-fast enthusiasts, called assertive programming, which I first read about in The Pragmatic Programmer.

Assertive programming follows the principle of failing fast by using assertions in the code to continuously validate the system’s state, throwing (crashing) if an assertion’s criteria have not been met.

Here’s an example:

const adult = generateAdult();
assert(adult.age >= 18);
sellItemTo(adult);

We add an assertion after generating an adult as a defensive check to ensure that the system will only ever get to line 3 if the statement adult.age >= 18 holds true. You might think, “but that won’t ever happen”. Well, although this example is an oversimplification, it’s always worth adding assertions to ensure that something that can’t happen, won’t.

Assertive programming is a good practice to validate your assumptions as you write code, especially if it’s rather clever and prone to mistakes:

function someCrazyCalculationThatReturnsAPositiveNumber(num) {
  // do something with num
  // ...
  return result;
}

for (let i = 0; i < total; i++) {
  const a = someCrazyCalculationThatReturnsAPositiveNumber(i);
  assert(a > 0); // I don't entirely trust my crazy calc code yet
  doSomethingWith(a);
}

Ultimately, it’s intended to be a tool to increase your confidence as you code and help you build robust, fail-fast systems.


Originally posted on https://medium.com/@denhox/the-benefits-of-fail-fast-systems-dc72a665cfb5