The months and days before and after CrowdStrike's fatal Friday

4 months ago 16
BOOK THIS SPACE FOR AD
ARTICLE AD

Analysis The great irony of the CrowdStrike fiasco is that a cybersecurity company caused the exact sort of massive global outage it was supposed to prevent. And it all started with an effort to make life more difficult for criminals and their malware, with an update to its endpoint detection and response tool Falcon.

Earlier today, the embattled security biz published a preliminary post-incident review about the faulty file update that inadvertently led to what has been described, by some, as the largest IT outage in history.

CrowdStrike also pledged to take a series of actions to ensure that this doesn't happen again, including more rigorous software testing and gradually rolling out these types of automated updates in a staggered manner, instead of pushing everything, everywhere, all at once. We're promised a full root-cause analysis at some point.

Here's a closer look at what happened, when, and how.

CrowdStrike structures its behavioral-based Falcon software so that it has so-called sensor content that defines templates of code that can be used to detect malicious activity on systems; and then produces and issues rapid response content data that customizes and uses those templates to pick up specific threats. The rapid response content configures how the sensor content's code templates should operate so that malware and intruders can be identified and stopped.

This content is pushed out to Falcon deployments in the form of channel files that you've heard all about.

As far as we're aware – and let us know any other details you may have – the security snafu started way back on February 28, when CrowdStrike developed and distributed a sensor update for Falcon intended to detect an emerging novel attack technique that abuses named pipes on Windows. Identifying this activity is a good way to minimize damage done by intruders. The sensor update apparently went through the usual testing successfully prior to release.

Days later, on March 5, the update was stress tested and validated for use. As a result, that same day, a rapid response update was distributed to customers that made use of the new malicious named pipe detection.

Three additional rapid response updates using this new code template were pushed between April 8 and April 24, and all of these "performed as expected in production," according to the vendor.

Then, three months later, came the rapid response update heard around the world.

At 0409 UTC on Friday, July 19, CrowdStrike pushed the ill-fated update to its Falcon endpoint security product. The rapid response content – one of two released that day – was intended to pick up on miscreants using named pipes on Windows to remote-control malware on infected computers, using that March sensor template update to detect this activity, but the data being delivered now was malformed.

It was simply incorrect, and worse: CrowdStrike's validation system to check that content updates will work as expected did not flag up the broken channel file that was about to be pushed out to everyone. The validator software was buggy, allowing the bad update to slip out when the release should have been stopped.

For people calling for CrowdStrike to test its releases prior to distribution, it did try – and blundered. As the biz now acknowledges, it needs to perform better testing and sandboxing before customers get their updates.

And so when Falcon tried to interpret the new, broken configuration information in the rapid response content on July 19, it was confused into accessing memory it shouldn't touch. As the security software runs within the context of the Windows operating system – to give it good visibility of the machine in order to scan and protect it – when it crashed from that bad memory access, it took out the whole OS and applications.

Users would see a dreaded blue screen of death, and the computer would go into a boot loop: Upon restarting, it would crash all over again, and repeat.

CrowdStrike deployed a fix at 0527 UTC the same day, but in the time it took its engineering team to remediate the issue — 78 minutes — at least 8.5 million Windows devices were put out of action. That's more than one million machines every ten minutes on average over that span; imagine if the fix wasn't deployed for longer – eg, for hours.

We're told that the channel file updates were not just to tackle use of named pipes to connect malware to remote command-and-control servers, but also to prevent the use of those pipes to hide malicious activity from security software like Falcon.

"It was actually a push for a behavioral analysis of the input itself," said Heath Renfrow, co-founder of disaster recovery firm Fenix24. "The cybercriminals, threat actors, they found a new little trick that has been able to get past EDR solutions and CrowdStrike was trying to address that. Obviously it caused a lot of issues."

By the time CrowdStrike pushed a fix to correct the error, millions of Windows machines weren't able to escape the boot loop. "So the fix really only helped the systems that had not turned into the blue screen of death yet," Renfrow told The Register. For systems that were already screwed, the broken channel file needed to be deleted or replaced, typically by hand which is bad news for anyone with thousands of PCs to repair.

In the short term, they're going to have to do a lot of groveling

That Friday, airlines, banks, emergency communications, hospitals, and other critical orgs including (gasp!) Starbucks ground to a halt. And criminals, seizing the opportunity to make money amid the chaos, quickly got to work phishing those who got hit and spinning up domains purporting to host fixes that were in fact malicious in nature.

Microsoft, in turn, provided sage advice for Falcon customers whose Azure VMs remained in a BSOD boot loop: reboot. A lot. "Several reboots (as many as 15 have been reported) may be required, but overall feedback is that reboots are an effective troubleshooting step at this stage," Redmond said on Friday. 

America's CISA weighed in with its initial alert at 1530 UTC on July 19. "CISA is aware of the widespread outage affecting Microsoft Windows hosts due to an issue with a recent CrowdStrike update and is working closely with CrowdStrike and federal, state, local, tribal and territorial (SLTT) partners, as well as critical infrastructure and international partners to assess impacts and support remediation efforts," the government agency said.

Later that day, at 1930 UTC, after an earlier non-apology on Xitter, CrowdStrike CEO and founder George Kurt did "sincerely apologize" to his company's customers and partners: 

We doubt that made IT administrators, who spent their entire weekend trying to remediate the problem and recover broken clients and servers - we're talking upwards of hundreds of thousands in some reported cases, feel any better about the fiasco.

The next day, at 0111 UTC on July 20, CrowdStrike published some technical details about the crash.

Weekend warriors

Microsoft, that Saturday, issued a recovery tool, which has since been updated with two repair options for Windows endpoints. One will help recover from WinPE (the Windows Preinstallation Environment) and a second will recover impacted devices from safe mode. 

By Sunday, July 21, the embattled endpoint vendor began issuing recovery instructions in a centralized remediation and guidance hub, beginning with help for impacted hosts and followed by how to recover Bitlocker keys for rebooted machines, plus what to do with affected cloud environments. It also noted that, of the 8.5 million borked Windows devices, "significant number are back online and operational."

As of 1137 UTC on July 22, CrowdStrike reported it had tested an update to the initial fix, and noted the update "has accelerated our ability to remediate hosts." It also pointed users to a YouTube video with steps on how to self-remediate impacted remote Windows laptops.

By Wednesday, July 24, Sevco Security CEO JJ Guy reckoned Crowdstrike service was about 95 percent restored. This is based on his firm's analysis of agent inventory data.

CrowdStrike blames a test software bug for that giant global mess it made CrowdStrike fiasco highlights growing Sino-Russian tech independence How did a CrowdStrike config file crash millions of Windows computers? We take a closer look at the code CrowdStrike Windows patchpocalypse could take weeks to fix, IT admins fear

Even if the vast majority of endpoints have been restored, full recovery may take weeks for some systems.

"The issue is that it's going to require manpower to physically go to a lot of these devices," Renfrow said. His biz issued free recovery scripts for borked Windows machines.

"But even with our automation scripts, that only takes about 95 percent of the work away," Renfrow added. "So you still have the other 5 percent, where it's going to have to be physically on there."

As the recovery process continues, Renfrow said he expects to see CrowdStrike start sending support personnel to its customers' locations. 

"I know they were circling the wagons with partners that have physical IT bodies that can go to client sites to help them, whoever is struggling," he said. "I think that is a step they're going to take, and I think they're going to pay the bills for that."

Also on Wednesday, Malaysia's digital minister said he had asked CrowdStrike and Microsoft to cover any monetary losses that customers suffered because of the outage. 

CrowdStrike did not respond to The Register's questions about the incident, including if it planned to compensate businesses for damages or pay for IT support to help recover borked machines. Cue the class-action lawsuits likely to be coming soon.

In addition to legal challenges, CrowdStrike also faces a congressional investigation, and Kurtz has been called to testify in front of the US House Committee on Homeland Security about vendor's role in the IT outage.

Can CrowdStrike recover?

The fiasco will likely cause reputational damage — but the extent of that, and any lasting impacts remain to be seen, and depends largely on CrowdStrike's continued response, according to Gartner analyst Jon Amato.

"In the short term, they're going to have to do a lot of groveling," Amato told The Register

"Let's just be realistic about this: They're gonna have some very, very uncomfortable conversations with clients at every level, the largest of the largest enterprises and agencies that use it right now, all the way down to the small and more medium business," he continued. "I don't envy them. They're going to have some really uncomfortable, and frankly miserable, conversations."

However, he added, the tech disaster "is recoverable," and "I think they do have a way out of it if they have to continue to be transparent, and they continue to have this communication with some degree of humility."

IDC Group VP of Security and Trust Frank Dickson said CrowdStrike can save its reputation if they admit their mistakes and implement better practices to increase transparency in the software update process.

Over the next three-to-six months, the cybersecurity shop "clearly is going to have to change their process by which they roll out updates," Dickson told The Register. This includes improving its software testing and implementing a phased roll-out in which updates are gradually pushed to bigger segments of the sensor base — both of which CrowdStrike today committed to doing.

"The issue with the CrowdStrike detection platform is it scales wonderfully, massively, very quickly," Dickson said. "But you can also scale a logic error very quickly. So they're going to need to implement procedures to make sure that they start doing more phased rollout, they're going to need to formalize this, put it in policy, and they're going to have to publish it for transparency so that all CISOs now can review this."

CrowdStrike isn't the first tech company to cause a global disaster because of a botched update. A routine McAfee antivirus update in 2010 similarly bricked massive numbers of Windows machines. CrowdStrike boss Kurtz, at the time, was McAfee's CTO.

This most certainly won't be the last software snafu, according to Amato. 

"It shouldn't have happened," he said. "But the fact is that software testing, no matter where the source is and what vendor we're talking about, ultimately depends upon humans. And humans, as it turns out, are fragile."

Even best practices in software design can fail, and "CrowdStrike had a great track record of product quality up until this point," Amato noted. "That seems to be the takeaway for me: This could have happened to literally any organization that operates the way CrowdStrike does."

By this, he means any software product that hooks into the Windows kernel and has that deep-level access to the operating systems. "It was just CrowdStrike's bad luck that it happened to them and their customers." ®

Read Entire Article