Commentary: Critical Infrastructure and Collective Responsibility

The 19 July 2024 will go down as one of the worst IT events in history, as a single security update took down over 8 million PCs and servers running the Windows operating system. What can we learn from the incident, and what does the IT world continue to get wrong?

Background

Many thousands of words have been written about the global outage to IT systems that stemmed from an update to a CrowdStrike Falcon channel file. The background to the incident can be found here (CrowdStrike blog), which provides fairly minimal details on the problem.

The channel file update is a system driver (although not specifically a kernel driver) and, in this instance, caused a BSOD (blue screen of death), essentially a crash on boot-up. The workaround to the issue is to manually boot Windows into Safe Mode, remove the affected file (channel file 291), and then reboot. The mechanics of this seem to be easier said than done, with some users reporting up to 15 reboots required and alternative workarounds needed for systems encrypted with BitLocker (or other security packages).

With a significant amount of manual intervention required to bring systems back, we can expect it will take some time to recover the claimed 8.5 million affected devices. Some could be simply rolled back to a last known good backup, whereas others may require on-site visits and physical presence to get the workaround in place.

Learning

What can we learn from this incident, and how does it demonstrate the evolution of IT over the last two decades?

Firstly, it is clear that many organisations have chosen to outsource administration responsibilities and place an increased dependency on both the public cloud and SaaS vendors. A single update was able to propagate to 8.5 million devices, apparently in a matter of minutes. In itself, this isn’t a big deal if the systems in question are non-critical (more on this in a moment).

Second, we continue to use a PC-based operating system developed nearly 40 years ago as the basis for a whole suite of devices that operate critical infrastructure. The design of the PC (and most servers) has barely changed in this time, with the standard design still using a single boot device, BIOS, and local operating system image. Undoubtedly, this design still persists for convenience and cost reasons.

Third, for many companies, the rationale of proper testing before deployment appears to have been thrown out the window (no pun intended). For at least a substantial part of the deployed Windows ecosystem, this update simply cannot have been tested before implementation. One wonders exactly what testing was done by CrowdStrike itself, but bear in mind that Windows is old (and therefore internally complex), and we have no evidence yet as to whether this incident occurred because of a specific set of circumstances on the affected devices.

Remember that there are perhaps upwards of a billion devices running Windows (800 million was quoted in 2019), so the affected footprint could be only 1% of the global user base. The affected users were customers of CrowdStrike, and we have no evidence as to whether all or only a subset of customers was affected. More information is needed from the company to provide a clearer breakdown.

Critical Infrastructure

Perhaps the greatest lesson from the CrowdStrike incident is to define what we deem “critical” infrastructure. If you were a patient waiting for a hospital appointment, then any systems used to manage your experience could be counted as critical. If you were planning a holiday and left stuck at an airport, then baggage handling and check-in systems are going to be critical for you.

In my experience (over 35 years in commercial IT), most businesses underestimate the dependency of IT systems and the impact of their failure. The cost/benefit model is typically skewed towards an assumption that systems will rarely (if ever) fail, and if they do, they will fail individually. A “mass failure” scenario is rarely considered because most outages are caused by hardware problems.

“Software becomes more reliable over time, whereas Hardware is bound to fail.”

I would suggest that few businesses execute disaster recovery plans that consider a mass software failure. Even fewer are prepared to architect such systems to build in additional resiliency to cover software-related issues. Dual-booted hardware (for example) would just add more cost. Unfortunately, with hindsight, many IT teams will be thinking that a more resilient software rollout process could have prevented Friday the 19^th, but few would have pushed to justify the upfront investment.

Most IT infrastructure today is more critical to ongoing operations than businesses think. Companies need to review their design principles and place more importance on systems directly used to run the business. This is a disaster recovery strategy that has been known about for half a century but seems to have been discarded in the push to reduce operating costs.

What Next

How do we move past last week’s incident? Here’s a blog post I wrote in 2011, highlighting the need to review applications running in the public cloud. Unfortunately, IT isn’t just cloud-based systems; technology is used to run pretty much everything a business operates. So, some practical advice.

A “root and branch” review of all IT systems delivering into all aspects of the business. Everything means everything, including systems running basic tasks such as signage. Each platform and system needs to be evaluated for criticality; can the business run without it for minutes, hours, or days? What is the impact and cost during that period (both financially and reputationally)?
Re-architecting – what can be done to make systems more resilient? Resiliency is a big topic and should cover both software and hardware failures, MTTR (mean time to repair/recover) and cost of repair/recovery. The resiliency discussion is challenging because it involves a cost/benefit discussion that isn’t always obvious. Nobody expected both twin towers to collapse in New York on 11 September 2001, for example. More “out of the box” thinking is required.
Remediation – Systems need to be updated to be more resilient. I specifically use the term “systems” here as IT is a collection of devices operating as a system. In the CrowdStrike incident, the update, for example, appeared to be capable of hitting entire systems with no staged or rollout process. This strategy is unacceptable when the risk of an update can take down an operating system and require manual intervention.

The Architect’s View®

The CrowdStrike incident is a wake-up call for many businesses that never assumed this type of scenario could happen. The Internet is being described as “fragile”, but in reality, this outage occurred due to poor operational practices and a dependency on legacy technology (the Windows operating system).

Some businesses are happy to take that risk, rightly judging that the cost of maintenance is outweighed by the impact (and the need to protect against ransomware). However, for many others (including in the UK, hospitals, schools, and GP surgeries), there simply aren’t the funds to build IT up and implement higher levels of resiliency. This is where the fragility will continue, not because the underlying technology is a problem but due to a lack of appropriate investment.

However, we all need to take collective responsibility and improve the design and implementation of IT systems to ensure this type of incident can never occur again. We need better solutions than Windows, especially remote systems where a physical “visit” is so expensive in terms of time and effort. Hardware vendors need to help too, with resilient hardware that works well with software, including remote management. Operational staff need to think through every failure scenario. Senior managers need to reflect on having good, well-paid operational staff rather than pushing responsibility off to third-party vendors.

Perhaps that last comment is the most important to bear in mind. Businesses want to deploy the latest and greatest technology, currently generative AI. But we need to build resilient systems, too, because our dependency on technology is so great.

I don’t expect much will change in the next decade. The next big outage will probably be due to the failure of an LLM, which has been used to underpin the business practices of thousands of companies. We can’t say we haven’t been warned. IT history continues to repeat itself, and the next big outage is just a matter of when, not if.