Just before 1 a.m. local time Friday, a systems administrator at a West Coast funeral and mortuary services company woke up suddenly to find his computer screen glowing. When he checked his work phone, it was full of messages about what his colleagues called a network problem. Their entire infrastructure was down, threatening to disrupt funerals and burials.
It soon became clear that the massive disruption was caused by a CrowdStrike outage. The security firm accidentally caused chaos around the world Friday and weekend after distributing flawed software on its Falcon surveillance platform, crippling airlines, hospitals and other businesses both miniature and enormous.
The administrator, who asked not to be identified because he is not authorized to speak publicly about the outage, immediately got to work. He ended up working nearly 20 hours a day, driving from morgue to morgue and personally resetting dozens of computers to fix the problem. The situation was urgent, the administrator explained, because the computers had to be back online to avoid disruptions in funeral service schedules and morgue communications with hospitals.
“With a problem as widespread as the CrowdStrike outage, it was logical to make sure our company was ready to go so we could accommodate these families so they could get services and be with their family members,” the system administrator says. “People are grieving.”
Faulty CrowdStrike update bricked up approximately 8.5 million Windows computers worldwide, sending them into a terrifying spiral of blue screens of death (BSODs). “The trust we had built up over the years in drips was lost in buckets in a matter of hours, and that was a punch in the gut,” Shawn Henry, Chief Security Officer at CrowdStrike, wrote on LinkedIn early Monday morning. “But that was nothing compared to the pain we caused our customers and partners. We let down the very people we swore to protect.”
Cloud platform outages and other software issues—including malicious cyberattacks—have caused major IT outages and global disruptions before. But last week’s incident was particularly notable for two reasons. First, it stemmed from a bug in software designed to support and defend networks, not harm them. Second, fixing the problem required direct access to each affected machine; a person had to manually boot each computer into Windows Sheltered Mode and apply a patch.
IT is often a thankless, unglamorous job, but the CrowdStrike disaster was a next-level test. Some IT pros had to work with remote workers or multiple overseas locations to guide them through manual device resets. One junior system administrator based in Indonesia for a fashion brand had to figure out how to overcome language barriers to do it. “It was daunting,” he says.
“We don’t get noticed until something bad happens,” a systems administrator at a Maryland health care facility told WIRED.
This person was woken shortly before 1:00 a.m. ET. Screens at the organization’s physical locations turned blue and unresponsive. Their team spent several early morning hours getting servers back online, then had to manually repair more than 5,000 other devices across the company. The outage blocked phone calls to the hospital and upended the medication dispensing system—everything had to be written down manually and run to the pharmacy on foot.
