What the Blue Screen of Death that froze the world taught us
Summary
- We need throttled rollouts of new software, with resilience built into computer systems that are tested rigorously. This is very important in the AI era, although it also tells us that an end-of-the-world threat is likelier to arise from a little IT bug than any AI super intelligence.
An Atlantic article (bit.ly/3YcLNV7) by Samuel Arbesman reminded me of a joke he has there: “A software engineer walks into a bar. He orders a beer. Orders zero beers. Orders 99,999,999,999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd. The bar is still standing. A real customer walks in and asks where the bathroom is. The bar bursts into flames."
For us non-software types, it basically means that for every piece of software, you need to stress-test it by flinging a variety of inputs at it that may induce errors. However, often, it is what’s not anticipated or thought about that causes the system to crash.
Also read: CrowdStrike-Microsoft outage 'largest' in history? What caused it, when will it be fixed, risks — all you need to know
So, the best way to make a system break-proof is to try and break it through any means possible and then fix the gaps. Netflix made this approach popular with Chaos Monkey, an internal software that would randomly attack systems and sub-systems to check how its infrastructure behaved. It helped engineer a robust system to serve movies on demand.
Global cybersecurity major CrowdStrike seems not to have learnt from this. On 19 July, a blast from the past, the dreaded Blue Screen of Death (BSOD), reappeared across computer screens, disrupting airports, hospitals and businesses worldwide.
The culprit? A patch from CrowdStrike of its Falcon software update that contained a glitch, causing Windows systems globally to crash and exposing the fragility of our interconnected digital world.
This faulty update contained an error that caused a major malfunction in the core of the Windows operating system (known as the ‘kernel’). Such a problem can usually be fixed by simply undoing the update, but this time the problem was so widespread that each affected computer had to be manually fixed, leading to significant delays and disruptions.
It was a kind of disruption anticipated in 2000 with the Y2K, just that it happened 24 years later!
Also read: Microsoft outage disrupts services worldwide
CrowdStrike scrambled to build patches, but it did take a while to fix. It resulted in missed flights, acrimony and lawsuits. Even as businesses limped back from what was probably is the largest IT outage ever, what can we learn from this?
This is all the more pertinent as we enter the age of AI, with powerful Large Language Models, autonomous AI agents and the looming inevitability of super-intelligence.
Importance of throttled rollouts: A glaring oversight in the BSOD incident seems to be the lack of a throttled rollout. In software deployment, especially for critical systems, updates are normally released to a small number of users first.
This allows for the detection of issues before they can impact a larger user base. For instance, the rollout of GPT-4o. Initially, it was made available to a limited number of free and Plus users with higher message limits. This careful and incremental approach ensured that any potential problems could be identified and resolved early, preventing widespread disruption.
Redundancy and resilience: The incident also highlighted the fragility of our digital infrastructure. Organizations must invest in redundancies and resilience, which include having backup systems, a multilayered cybersecurity approach (using solutions from multiple reputable vendors), and robust incident response plans.
Relying heavily on a single vendor for cybersecurity, as seen with CrowdStrike’s extensive use, can create a single point of failure.
Proactive monitoring and testing: Before rolling out major updates, conducting a ‘pre-mortem’ can be invaluable. This exercise involves gathering key stakeholders to brainstorm potential failure points before they happen. This includes rigorous and frequent testing of updates in staged environments that closely mirror production systems.
Additionally, continuous real-time monitoring powered by advanced analytics and machine learning can detect anomalies and potential threats early on. Implementing automated rollback mechanisms can also act as a safety valve, providing a rapid solution when unexpected issues arise.
Also read: Microsoft AI push may boost domestic laptop sales
All this is even more relevant in the AI era. The systems that run our world are not built by a single entity, but use a smorgasbord of interconnected computers, servers and data centres—what we loosely call the cloud.
The Windows outage also made apparent the fact that the ‘end of the world’ may be triggered by a mundane IT failure rather than the actions of any much-ballyhooed AI super intelligence. As an article (bit.ly/3SbDXan) in the New York Times cites TS Eliot: “This is the way the world ends/Not with a bang but a whimper."
Anuj Magazine, co-founder of AI&Beyond, contributed to this article.