Engineering for Availability: Lessons from the CrowdStrike Incident

Sarah Walker’s recent post talked about the different perspectives on security between IT and engineering. I want to build on this and dig deeper into something that doesn’t get enough attention: Availability. It’s part of the classic security triad - Confidentiality, Integrity, and Availability - which describe the desirable security characteristics of a system. People often think “availability” is synonymous with system uptime. The CrowdStrike incident last year shows it’s more than that.

The CrowdStrike Incident: A Wake-Up Call

So, what happened with CrowdStrike? In a nutshell, the vendor pushed out an update that effectively acted like a ransomware-style denial-of-service attack, albeit with a larger impact because it affected many companies at the same time. A multitude of machines suddenly became unavailable. These companies then had to activate their disaster recovery response and expend a ton of effort to get back up and running.

I think the naive reaction is to say that this is fundamentally a supply chain problem. This is a problem that’s broader than a key vendor making a catastrophic mistake. The problem is companies’ lack of attention to Availability. It’s about companies being unaware of their points of failure for their mission-critical business systems. Or more realistically, it’s about companies having mission-critical systems that they don’t know are mission-critical.

Availability in the Real World

Ensuring system availability through sound engineering is adjacent to business continuity planning. We should think about these topics the same way. This means going beyond robust application security or securely implemented networks. We also need to design systems that are resilient in the face of unexpected events. And we need to devise business practices that are resilient to the unavailability of systems.

A big part of this is identifying single points of failure. In the CrowdStrike incident, a piece of software intended to ensure continuity for critical systems itself became that single point of failure for many organizations. For every component of a system we need to ask ourselves: “what happens if this breaks?” And then we need to break it to see what happens.

In a previous life I was a software engineer for a commodities trading company (a “prop shop”) in Chicago. Two anecdotes from that time illustrate what I consider to be the foundation of Availability.

First, the trading industry is a special flower, so to speak, in that software developers were directly involved in business operations. I’ve been in meetings where a bug happened – say, a system had gone down for 45 minutes – and we were doing a post mortem later that day. The trading managers could say “this application was offline and it cost us $X dollars”. And, for the record, $X was usually more like $XX,XXX.

Traders know exactly which of their systems are mission critical and what the impact to the business is when they fail. I sometimes wish we could trace such an uncomplicated and direct path from system availability to bottom line impact with our security clients.

Second, I was in this industry in the waning years of open outcry pit trading, and got to work directly with floor traders who learned their craft in a very interesting availability environment. A trading system for this profession is a highly specialized, complex application with a small user base and constant software updates. In other words, it is inherently fragile.

Traders relied on these systems for their day-to-day work, but also knew how to deal with the inevitable, periodic failures. Before heading down to their positions in the morning, they would make sure to tuck into their coat pocket a daily printout of critical financial risk information. When their handheld tablets became unavailable, they would revert to an earlier era of trading that relied on simpler tools: mentally approximating prices based off a few key numbers, social trust, and lots of yelling.

Testing, Testing, and More Testing

There is a truism in engineering that applies across the board: if you didn’t test it, then it doesn’t work. If you have an availability plan, if you’ve never actually gone through the process of breaking things and recovering, then it doesn’t work. You have to actually go through the process to ensure that the process works.

At the trading firm, one accidental benefit of having a fragile trading system is that it went offline multiple times per year. As a result, we had a lot of practice using our backup procedures. That was a less profitable mode, but we were still operating.

There is a sinister side-effect of having robust, high-availability systems: we never get to practice operating without them. For these systems, we need to manufacture failure. In other words, we need to break it to see what happens.

There are several well-known examples from the industry. Amazon AWS uses a conceptual framework called “Game Day”:

A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. These should be conducted regularly so that your team builds “muscle memory” on how to respond. Your game days should cover the areas of operations, security, reliability, performance, and cost.

Another well-known example of this is Netflix’s Chaos Monkey:

Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.

These resiliency testing approaches are sometimes referred to as Chaos Engineering, and work at the broadest level of Availability.

Bringing It All Together

The CrowdStrike incident shows how a company can be doing everything “correctly” and still fail to ensure availability of critical business functions. We need to think about Availability in the broadest sense to ensure that our companies continue to operate during adverse circumstances. We need to:

Break stuff: Identify points of failure by forcing unusual or infrequent problems to occur more regularly
Fix stuff: Engineer systems to ensure availability despite localized failures
Learn stuff: Practice how to continue operating when entire systems fail

CrowdStrike, a tool meant to protect us from malware, has highlighted the need for us to broaden our understanding of how malware (and malware-like) events can occur. If we don’t, fragility in our systems will continue to translate directly into fragility for our businesses.