Should you quit CrowdStrike?

The three weeks since the July 19 Crowdstrike outage now known as the ‘Channel File 291 Incident’ have likely been some of the longest ever for IT teams. Just like when Wannacry ricocheted around the world in 2017, people collectively freaked out when BSODs (blue screen of death) began showing up in airports, hospitals, and offices – over 8.5 million endpoints were ultimately affected. The outage caused major travel and business interruptions, and the recovery process has been arduous.

Our integrated team of threat hunters, security experts, data scientists, and claims professionals in the Resilience Risk Operations Center worked to pinpoint and notify potentially affected clients, understand what was happening with their CrowdStrike instances and what the financial implications to their businesses might be, and guide them toward resources for remediation and recovery.

As we scoured intelligence sources for new information, we observed a range of discussions emerging both from insurance and cyber security circles. Media and experts called it a catastrophe, likening it to a category-level natural disaster. Discussions were rife about ripping and replacing CrowdStrike’s flagship EDR (endpoint detection and response) product. Others advocated for the overhaul of update practices to prevent the spread of bad update files.

We take a look at some of those reactions here with Rob Mealey, Director of Data Science and Tyler Boire, Senior Intelligence Analyst in the Resilience Risk Operations Center. Looking at these arguments through a lens of cyber risk helps clarify both our position to the problem (or catastrophe) and the best course of action going forward.

Don’t rip and replace a vendor over one incident

During our investigation into the CrowdStrike outage, the Resilience CTI team identified many posts on LinkedIn, Twitter, and Mastodon about organizations moving away from the CrowdStrike EDR platform over the disruption. While outages are a legitimate concern and reason to move away from a vendor, we urge organizations to avoid making knee-jerk reactions regarding their security infrastructure.

With any vendor selection, an organization should also look at the pros and cons of switching to a new system and align that with their threat models. There is no promise that a new vendor would not fall victim to similar outages and place an organization in a worse position, having given up a vendor they knew well for a new system. On the flip side, if a vendor shows a history of outages, poor communication, and long delays in remediation, it may be time to consider the benefits of moving away from an existing vendor that is not meeting business needs.

With new vendors, there are always costs to consider beyond the tool’s price, including implementation time, costs in training staff, and fine-tuning your workflows to match business needs. These costs should be weighed against the potential of lost productivity and business interruption if another large-scale outage occurs with your existing vendor.

Our data shows that Crowdstrike Falcon is very effective in preventing cyber attacks. Of the EDR vendors we track within our portfolio clients at Resilience, CrowdStrike has the lowest percentage of material claims in our portfolio, with fewer than 3% of clients with Falcon EDR experiencing an incurred claim. We cannot say with certainty that any incident is isolated, but it’s important to put any vendor’s track record in perspective.

As the technical leadership of an organization works through these cost-benefit analyses, business leaders will likely be pushing for or against change, either out of fear of another business interruption or not wanting to disrupt organizational inertia. These business leaders will need to be met in a language they understand, and the problem quantified in dollars and cents to help them make informed decisions with data rather than operate out of fear, uncertainty, and doubt.

Don’t radically change your update processes

Throughout the CrowdStrike issue, the topic of update cadence and testing has also been raised repeatedly. Pundits have claimed that organizations should have waited to see if anything broke in tests before rolling out updates, or they should have done roll-outs in stages to identify any bad updates, as is common in operating system updates. In these cases, updates become a Risk/Reward calculation. While some organizations may opt to delay updates for days or even weeks to test thoroughly, anti-virus and EDR signatures are pushed to help defend against emerging threats and are often quite time-sensitive. As threat actors continue to speed up the delta between vulnerability disclosures and their weaponization, timely security updates are becoming even more critical. Some software suites like Chrome Browser have even taken to update a user’s browser automatically to ensure they are protected, even if the user is unaware.

At an organizational level, breaking changes could lead to business interruption, such as the CrowdStrike outage, and must be weighed against the risks and losses associated with attacks like ransomware throughout the organization. Updates may also affect productivity through downtime and the incompatibility of file formats in newer versions. Tools like databases, architectural design, photo manipulation, graphic design, and CAD software could introduce incompatibilities with older work, impacting a user’s ability to do their job.

So, while some “Best Practices” may truly be the best course of action, modern system administrators and security teams need to consider the totality of the risk they are introducing or mitigating with their processes and procedures. What works for some organizations may be crippling to others, and the risk tolerance must be understood and agreed upon, throughout the upper leadership at your organization. By understanding the business implications, costs, and risks associated with the processes and procedures you’re implementing, better, safer choices can be made that weigh these negative outcomes against the institutional “Best Practices” parroted by outsiders on social media.

Don’t compare it to a natural disaster

Many in the cyber insurance industry rely on the analogy of a natural disaster in the discussion of potential catastrophic scenarios and this makes sense. It is the frame that most of the traditional insurers moving into cyber feel most comfortable with, as it is the frame they operate from in large parts of their other business. Unfortunately, it is not a very useful or accurate analogy when it comes to cyber events. In fact, it is fairly harmful as it doesn’t accurately reflect the way IT and security teams operate during a crisis. This recent Crowdstrike incident goes a long way towards showing why.

When a powerful earthquake hits – especially in a densely populated area – the impact is immediate, widespread and not mitigatable in real time in any meaningful way. There is no way to start rebuilding the house while it’s collapsing around you. Mitigation and management are after-the-fact undertakings, where the focus is on regulation before the fact, and repair, rescue and restoration afterwards.

This is not true of cybersecurity, in fundamental ways. The only things that can truly be thought of as natural disasters in cyber are actual natural disasters. Most cyber policies don’t cover those. The analogy of hurricanes and earthquakes to actual covered events in cybersecurity is largely useless, and actively harmful to the understanding of people without deep cybersecurity knowledge.

It causes many to become overly focused on dramatic, simplistic scenarios that can be visualized as category hurricanes or earthquakes. Not only are truly catastrophic cyber scenarios far more unlikely than even the large cat modeling firms predict, but are often so extreme that if they did occur, the effect on the business of cyber insurance would be the least of our problems. More importantly, it causes many to spend their mental energy in imagined lands of solar flares and alien invasions instead of actual reality, where things like this Crowdstrike event can and do happen and can and must be anticipated and defended against.

Another important difference between cyber incidents and natural disasters is that there are at least two sides to every incident in cybersecurity. Attackers and defenders. Vendors suffering severe accidental outages AND all the clients of those vendors. It is possible, indeed essential, for an organization to start to rebuild the house while it is falling down, so to speak, during a cyber security incident. It is not possible to impede or rebuff an impending hurricane. It is possible, indeed a best practice, to actively defend against and redirect a cyber attack or incident while experiencing it.

There is ample reason to be skeptical of the predicted “precarity” of big picture systems, to believe that all of the complex systems everything runs on are far more resilient that people give them credit for. This is not to say that there haven’t been and won’t be very real and scary failures, but considering how much bluster and catastrophizing published in newspapers and blogs about potential failures, the actual rate of failure is pretty low. This was as apparent in this recent system-failure-driven event as it has been in the past with attacker-driven events like WannaCry, where massive collective responses were activated in the immediate aftermath of the event that almost certainly dramatically limited the eventual scope and magnitude of the impact.

To be clear, this is not to say we should be complacent. But nor should we operate from a place of pure anxiety. Our industry must build tools with clear, interrogatable assumptions based on data and collective comprehensive expert risk assessments. And we must validate these tools and iterate over them, ensuring that they continue to reflect the reality of our business in a useful and actionable way.

Don’t Panic.

As the dust settles, we are coming to understand more about what caused the CrowdStrike outage and what the downstream impact has been on our clients and the industry at large. We are forced to see that this kind of event may be more common in years to come as the interconnected nature of our tech ecosystem makes things fragile and volatile.

To take accurate stock of what we should do, we need to move past the analogies that are chaining us to overly simplistic views of our risks and further develop and deepen our understanding of what is actually taking place, how it is taking place, and why. This approach will allow us to both protect ourselves and our clients, and at the same time, allow us to support the innovation and take advantage of the opportunities that are essential for us to actually deliver on the promise of cyber resilience.

We also need to resist the urge to make reactionary decisions to our practices but to take a mindset of continuous improvement and increased resilience to the most financially damaging threats.

__________

Learn more about why managing third-party risks is crucial in our webinar Thursday, September 5, 2024 | 11 AM-12 PM ET Strengthening your Security: A Guide to Vulnerability Management.