Is Our Future a Blue Screen of Death?

Is Our Future a Blue Screen of Death?

James B. MeigsSeptember 2024 Commentary Magazine

Some say the world will end in fire, some say in ice.
Personally, my money’s on the blue screen of death.

On the morning of July 19, millions of users across Europe went to boot up their computers and encountered that dreaded blue screen known to IT experts by the highly technical acronym, BSOD. The software glitch disabled servers and business computers running Microsoft’s Windows operating system. (Mac and Linux systems, which have much smaller shares of the market, were spared.) The failures crippled businesses large and small, including supermarkets, banks, and airlines. Britain’s SkyNews TV network was knocked off the air. As dawn broke in North America, the problem spread to our shores. Soon United, American, and other airlines were cancelling flights, hospitals began turning away patients, and 911 call centers went down. According to one estimate, the outage will cost Fortune 500 companies alone more than $5 billion in immediate losses.

IT specialists flooded Reddit message boards to commiserate with colleagues and share tips on rebooting broken networks. “This is what Y2K wishes it was,” wrote one. That widely anticipated turn-of-the-century digital apocalypse never happened (thanks mostly to years of careful preparation on the part of IT workers). But the widespread collapse of computer networks this past July was a reminder that vital aspects of our modern life depend on delicate digital systems that even experts don’t fully understand. The incident should have sparked global conversation. But since the breakdown took place during a busy news week—between the Trump shooting and Biden quitting the presidential race—it soon fell from the headlines. It is worth a closer look.

“We haven’t seen a cascading failure like this—maybe ever,” one digital-security expert told the Washington Post. This massive disruption of digital infrastructure didn’t come from Russian or Chinese hackers. It came from CrowdStrike, one of the world’s top firms devoted to defending computer networks from malicious attacks. The global security firm investigates and rebuffs hacking attempts on its customers. (It investigated the 2016 hack of Democratic National Committee emails, for example.) It also constantly rolls out software to protect its clients’ networks from the latest viruses and hacking techniques.

On July 19, CrowdStrike published an update to its Falcon cybersecurity program just after midnight (Eastern time). Realizing the release contained corrupted code, the company rolled it back 90 minutes later. Not fast enough. By then the update had been automatically downloaded by millions of computers. To make matters worse, CrowdStrike’s Falcon program runs in the operating system’s core, or “kernel,” which controls all other software and hardware functions. Once the faulty code was installed, the device would not even boot up. In many cases, restoring the crippled computers required an IT worker to manually delete the offending bit of code from every individual computer—a laborious process.

Microsoft estimates the CrowdStrike outage affected 8.5 million Windows devices. That’s an unprecedented breakdown. But it is also a reminder that the problem could have been so much worse: The afflicted computers represent less than 1 percent of Windows devices around the world. Still,the CrowdStrike crash—or BSOD24 as I’m calling it—should be a global wake-up call.

Four years ago, I wrote a column for COMMENTARY I called “Our Over-Connected World.” It discussed what disaster researchers call tightly coupled systems. Many vital forms of infrastructure—think railways, pipelines, and chemical plants—combine complex tech-nologies in closely coordinated networks. When these tightly coupled systems work—which is virtually all the time—they make everything faster and more efficient. For example, the modern electric power grid links together many separate utilities in vast networks. This helps utilities quickly shuttle power to where it is needed. But it also means that problems in any part of the system can rapidly propagate across the entire grid. In 2003, for example, a short-circuit event on a single Ohio power line caused a cascading blackout that knocked out power to much of the eastern U.S. and Canada.

Tightly coupled systems are making our power grid, our supply chains, and our digital networks more prone to hair-trigger breakdowns. The problem is compounded when many different organizations rely on the same software or hardware. Most of the world’s businesses use the Windows operating system. And a large share of those rely on CrowdStrike software to protect them from hackers. That’s convenient for everyone. But the ubiquity of the Windows operating system is precisely what made so many networks vulnerable to hidden flaws in a routine software update. If our software platforms were more varied, no single problem would be as likely to take out a huge chunk of the world’s computers at once.

Fortunately, BSOD24 was fairly limited and quickly, if painfully, corrected. But what if a similar code error—or a deliberate cyberattack—was more widespread and lasted longer? If a massive failure took down, say, 10 percent of the world’s computers, the impact could be catastrophic. Within minutes or hours, the collapse would cascade into other forms of infrastructure: The Internet would start to crumble, power grids and pipelines would falter, banks and ATMs would shut down. If the outage lasted more than a few days, store shelves and gas stations would go empty. After that…well, choose your dystopia.

What can we do to ensure that our wonderful, high-tech world doesn’t turn into Cormac McCarthy’s The Road? A friend of mine who runs a car-repair business keeps his customer files on an ancient IBM computer running a long-forgotten program. It’s not cutting-edge, but it gets the job done. And since it isn’t connected to the Internet, it is unhackable and keeps his customers’ data safe. That solution won’t work for most applications, but it is always worth asking, is the high-tech solution always better? Does everything have to be networked? (I recently wrote about going to some lengths to avoid Wi-Fi-connected appliances when I remodeled my kitchen. I don’t want my microwave talking to the cloud behind my back.) Those are small-scale examples, but all organizations should be exploring how to lower their exposure to tight-coupling risks.

Will companies really want to sacrifice a bit of profitable efficiency today in exchange for unquantifiable protection from rare black-swan events like the CrowdStrike outage? Just ask Delta Airlines, which had to cancel 5,000 flights, wrestled with software gremlins for almost a week, and lost $500 million in the debacle.

As for regulations, I am leery about putting the government in charge of these decisions. The Fitch credit-rating agency noted that BSOD24 “highlights a growing risk of single points of failure,” in this case, the worldwide overreliance on a single operating system. It would be great if more software companies vied to enter this business. But more regulation would make that outcome less, rather than more, likely. Complex regulations usually favor entrenched competitors, which have the resources to follow complex rules and the political clout to shape those rules in their favor. Start-ups are rare in heavily regulated industries.

And regulations have unintended consequences. Why did Microsoft allow CrowdStrike software to operate at the kernel level where it could wreak such havoc? A company spokesperson told the Wall Street Journal that Microsoft is required to allow kernel access under a 2009 agreement with the European Commission. That deal was supposed to limit Microsoft’s supposed monopoly power by making it easier for outside companies like CrowdStrike to build security products for Windows users. In effect, the rule makes it hard for Microsoft to control what janky code might get installed in Windows systems around the world. Thanks for the help, EU!

Tight coupling isn’t going away anytime soon. I hope business leaders are looking at BSOD24 as a warning and are working to make their systems more resilient. And for everyone following along at home: I’d recommend stocking up on canned food. The networks that sustain our comfortable lives are shakier than they look.