BOOK THIS SPACE FOR AD
ARTICLE ADMicrosoft Windows powers more than a billion PCs and millions of servers worldwide, many of them playing key roles in facilities that serve customers directly. So, what happens when a trusted software provider delivers an update that causes those PCs to immediately stop working?
As of July 19, 2024, we know the answer to that question: Chaos ensues.
In this case, the trusted software developer is a firm called CrowdStrike Holdings, whose previous claim to fame was being the security firm that analyzed the 2016 hack of servers owned by the Democratic National Committee. That's just a quaint memory now, as the firm will forever be known as The Company That Caused The Largest IT Outage In History. It grounded airplanes, cut off access to some banking systems, disrupted major healthcare networks, and threw at least one news network off the air.
Also: The best VPN services: Expert tested and reviewed
Microsoft estimates that the CrowdStrike update affected 8.5 million Windows devices. That's a tiny percentage of the worldwide installed base, but as David Weston, Microsoft's Vice President for Enterprise and OS Security, notes, "the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services." According to a Reuters report, "Over half of Fortune 500 companies and many government bodies such as the top U.S. cybersecurity agency itself, the Cybersecurity and Infrastructure Security Agency, use the company's software."
What happened?
CrowdStrike, which sells security software designed to keep systems safe from external attacks, pushed a faulty "sensor configuration update" to the millions and millions of PCs worldwide running its Falcon Sensor software. That update was, according to CrowdStrike, a "Channel File" whose function was to identify newly observed, malicious activity by cyberattackers.
Although the update file had a .sys extension, it was not itself a kernel driver. But it communicates with other components in the Falcon sensor that run in the same space as the Windows kernel, the most privileged level on a Windows PC, where they interact directly with memory and hardware. CrowdStrike says a "logic error" in that code caused Windows PCs and servers to crash within seconds after they booted up, displaying a STOP error, more colloquially known as the Blue Screen of Death.
Also: Microsoft is changing how it delivers Windows updates: 4 things you need to know
Repairing the damage from a flaw like this is a painfully tedious process that requires manually rebooting every affected PC into the Windows Recovery Environment and then deleting the defective file from the PC using the old-school command line interface. And if the PC in question has its system drive protected by Microsoft's BitLocker encryption software, as virtually all business PCs do, the fix requires one extra step: entering a unique 48-character BitLocker recovery key to gain access to the drive and allow removal of the faulty CrowdStrike driver.
If you know anyone whose job involves administering Windows PCs in a corporate network that uses the CrowdStrike code, you can be confident they are very busy right now, and will be for days to come.
We've seen this movie before
When I first heard about this catastrophe (and I am not misusing that word, I assure you), I thought it sounded familiar. On Reddit's Sysadmin Subreddit, user u/externedguy reminded me why. Maybe you remember this story from 14 years ago:
"Defective McAfee update causes worldwide meltdown of XP PCs."
Oops, they did it again.
At 6AM today, McAfee released an update to its antivirus definitions for corporate customers that had a slight problem. And by "slight problem," I mean the kind that renders a PC useless until tech support shows up to repair the damage manually. As I commented on Twitter earlier today, I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today.
In that case, McAfee had delivered a faulty virus definition (DAT) file to PCs running Windows XP. That file falsely detected a crucial Windows system file, Svchost.exe, as a virus and deleted it. The result, according to a contemporary report, is that "affected systems will enter a reboot loop and [lose] all network access."
The parallels between that 2010 incident and this year's CrowdStrike outage are uncanny. At its core was a defective update, pushed to millions of PCs running a powerful software agent, causing the affected devices to stop working. Recovery required manual intervention on every single device. And the flawed code was pushed out by a public company desperately trying to grow in a brutally competitive marketplace.
The timing was particularly unfortunate for McAfee. Intel had announced its intention to acquire McAfee, Inc. for $7.68 billion on April 19, 2010. The defective DAT file was released two days later, on April 21.
That 2010 McAfee screw-up was a big deal, kneecapping Fortune 500 companies (including Intel!) as well as universities and government/military deployments worldwide. It knocked 10% of the cash registers at Australia's largest grocery chain offline, forcing the closure of 14-18 stores.
Also: 5 ways to save your Windows 10 PC in 2025 - and most are free
And in the You Can't Make This Up Department… CrowdStrike's founder and CEO, George Kurtz, was McAfee's Chief Technology Officer during that 2010 incident.
What makes the 2024 sequel so much worse is that it also affected Windows-based servers running in the cloud, on Microsoft's Azure and on Amazon's AWS. And just as with the many laptops and desktop PCs that were bricked by this faulty update, the cloud-based servers require time-consuming manual interventions to recover.
CrowdStrike's QA failed
Surprisingly, this isn't the first faulty Falcon Sensor update from CrowdStrike this year.
Less than a month earlier, according to a report from The Stack, CrowdStrike released a detection logic update for the Falcon sensor that exposed a bug in the sensor's Memory Scanning feature. "The result of the bug," CrowdStrike wrote in a customer advisory, "is a logic error in the CsFalconService that can cause the Falcon sensor for Windows to consume 100% of a single CPU core." The company rolled back the update, and customers were able to resume normal operations by rebooting.
Also: When Windows 10 support runs out, you have 5 options but only 2 are worth considering
At the time, computer security expert Will Thomas noted on X/Twitter, "[T]his just goes to show how important it is to download new updates to 1 machine to test it first before rolling out to the whole fleet!"
In that 2010 incident, the root cause turned out to be a complete breakdown of the QA process. It seems self-evident that a similar failure in QA is at work here. Were these two CrowdStrike updates not tested before they were pushed out to millions of devices?
Part of the problem might be a company culture that's long on tough talk. In the most recent CrowdStrike earnings call, CEO George Kurtz boasted about the company's ability to "ship game-changing products at rapid pace," taking special aim at Microsoft:
And more recently, following yet another major Microsoft breach in CIS' Cyber Safety Review Board's findings, we received an outpouring of requests from the market for help. We decided enough is enough, there's a widespread crisis of confidence among security and IT teams within the Microsoft security customer base.
[…]
Feedback has been overwhelmingly positive. CISAs now have the ability to reduce monoculture risk from only using Microsoft products and cloud services. Our innovation continues at breakneck pace multiplying the reasons for the market to consolidate on Falcon. Thousands of organizations are consolidating on the Falcon platform.
Given recent events, some of those customers might be wondering whether that "breakneck pace" is part of the problem.
How much fault should Microsoft shoulder?
It's impossible to let Microsoft completely off the hook. After all, the Falcon sensor problems were unique to Windows PCs, as admins in Linux and Mac-focused shops were quick to remind us.
Partly, that's an architectural issue. Developers of system-level apps for Windows, including security software, historically implement their features using kernel extensions and drivers. As this example illustrates, faulty code running in the kernel space can cause unrecoverable crashes, whereas code running in user space can't.
Also: 7 ways to make Windows 11 less annoying
That used to be the case with MacOS as well, but in 2020, with MacOS 11, Apple changed the architecture of its flagship OS to strongly discourage the use of kernel extensions. Instead, developers are urged to write system extensions that run in user space rather than at the kernel level. On MacOS, CrowdStrike uses Apple's Endpoint Security Framework and says using that design, "Falcon achieves the same levels of visibility, detection, and protection exclusively via a user space sensor."
Could Microsoft make the same sort of change for Windows? Perhaps, but doing so would certainly bring down the wrath of antitrust regulators, especially in Europe. The problem is especially acute because Microsoft has a lucrative enterprise security business, and any architectural change that makes life more difficult for competitors like CrowdStrike would be rightly seen as anticompetitive.
Microsoft currently offers APIs for Microsoft Defender for Endpoint, but competitors aren't likely to use them. They'd much rather argue that their software is superior, and using the "inferior" offering from Microsoft would be hard to explain to customers.
But this incident, which caused many billions of dollars' worth of damage, should be a wake-up call for the entire IT community. At a minimum, CrowdStrike needs to step up its testing game. And customers need to be more cautious about allowing this sort of code to deploy on their networks without testing it themselves.