Updates … My perspective on the recent Microsoft/Crowdstrike Outage
A perspective on the recent outage of Microsoft Windows systems running Crowdstrike's Falcon, and how a comprehensive configuration scanning solution is the key to preventing this and similar issues.
As outlined in a previous article, the most effective way of staying ahead of the CVE curve is to update regularly. Some software vendors handle this on your behalf, and for others, it is your responsibility. In the former case, in theory, you will always stay up to date as the vendor views it as their responsibility that you stay up to date, that nothing crashes and all edge cases are considered. It is also the greatest assurance that you’ll have the latest version of any specific software. In the “good” news: All banks, airlines, hospitals, etc. that were affected by the recent outage caused by Crowdstrike on Microsoft systems did their updates timely… unfortunately, all at the same time ;-)
The Bad News
Those who were affected updated their Crowdstrike at the same time and due to a faulty update, all systems went down simultaneously. While we are certainly in favor of timely updates, this particular series of events caused an absolute disaster.
The Core Issue
Based on the recommended remediation steps, one can clearly see what the culprit was: An ill-formatted update script which caused an over-privileged module on the system to crash and move the system into a blue-screen of death. We all know that it's been many, many years since software was delivered or even operated through a local machine via hard-drives and/or CDs, DVDs or going *way* back floppy disks. There is now always an external piece of infrastructure involved, and it needs to be configured. Additionally, the update/patch file that’s distributed needs to be judiciously double-checked.
The current blind-spots that can cause these issues
Companies like Crowdstrike and Microsoft are clearly doing everything in their power to ensure their systems run and that they are stable. The current best practices are code-scanning, testing in all flavors (integration, unit and behavior), basic cloud security scanning on the infrastructure layer, and some CVE discovery. There are many more configurations at all layers, but these are most of the time not considered.
In the update process for this driver, there was likely code which was dutifully scanned, it passed all the tests on the testing systems and was deemed ready for release. After that, a (hopefully automated) process was started to distribute the update to customers. That last piece, and the applications with their configuration involved, are currently blind-spots in the scanning-landscape for most companies.
Methodologies to improve configuration scanning is a crucial aspect of prevention - and *certainly* in this case: “An ounce of prevention is worth a pound of cure” - Or a metric ton of cure in this case.
Let’s start at the faulty update files. From the discussions, it appears that these files were customized for each customer. Verifications of those and having custom policies (CoGuard Documentation) is absolutely essential.
Besides that, the update mechanism process did something that is a known and unfortunately a common misconfiguration in such an update system: Not enabling a staged update.
In lay-man’s terms: This means not pushing the update out all at once, but segmenting the systems and beginning with a small percentage of them. If the updates are successful there, push it out to a larger percentage until you reach 100%.
The fact that their system was apparently not configured to perform staged/rolling updates to the connected systems should have been flagged by an automated scanning tool that considers configurations. But this level of configuration checking on the application layer seems to have not been implemented for them. And that is the main problem here: Wherever there is a configuration to scan, one should certainly scan the configuration. If the setup with its settings of that update system would have been scanned, this issue would have caught the attention of the developers. There are best practices and security considerations at every layer, and only automated tools have the ability to consider all of them.
Conclusion: Scan all configs - putting an end to the Great Unknown
Systems and code are massively complicated. That is why automated checkers are absolutely critical, and this is exactly why CoGuard exists. This is the BIG problem we solve in an elegant and simple way. Most companies have mastered code quality by using tools like SAST, DAST and Fuzzing but the same quality standards with automated checking do not apply to the other critical piece of modern software: The attached infrastructure. We enable you to apply the same attention that we give code-quality to deployment and application dependency configuration management. Even if you’re running software that is highly customized, we can help you create bespoke policies that ensure best practices with every commit. Scanning configurations is as paramount to the stability and security of your software as scanning your code. It’s absolutely crucial that you are doing this with the same enthusiasm and vigor as you do when scanning your code.
Oops! Something went wrong while submitting the form.
Check out and explore a test environment to run infra audits on sample repositories of web applications and view select reports on CoGuard's interative dashboard today.