OpenAI blames one of the longest outages in its history on a “new telemetry service” that went wrong.
On Wednesday, OpenAI’s AI-powered Chatbot platform, ChatGPT; his video generator, Sora; and its developer-facing API experienced major disruptions beginning around 3 p.m. Pacific Time. OpenAI noticed the problem shortly thereafter and began working on a solution. However, it would take the company about three hours to restore all services.
In the autopsy published slow Thursday, OpenAI wrote that the outage was not caused by a security incident or a recent product launch, but by a telemetry service deployed on Wednesday to collect Kubernetes metrics. Kubernetes is an open source program that helps manage containers, or application packages, and related files used to run software in isolated environments.
“Telemetry services have a very broad scope, so configuring this new service unintentionally resulted in… resource-intensive Kubernetes API operations,” OpenAI wrote in a post-mortem. “[Our] Kubernetes API servers became overloaded, causing the Kubernetes control plane to be disabled on most of our vast servers [Kubernetes] clusters.”
That’s a lot of jargon, but basically the recent telemetry service has impacted OpenAI’s Kubernetes operations, including the resource on which many of the company’s services rely on DNS resolution. DNS resolution converts IP addresses to domain names; this allows you to type “Google.com” instead of “142.250.191.78”.
OpenAI’s utilize of DNS caching, which stores information about previously searched domain names (such as website addresses) and their corresponding IP addresses, complicates matters by “latency[ing] visibility,” OpenAI wrote, and “enabling deployment [of the telemetry service] continue until the full scope of the problem is understood.”
OpenAI says it was able to detect the issue “a few minutes” before customers finally started seeing the effects, but it was unable to quickly deploy a fix because it had to work around overloaded Kubernetes servers.
“This was a confluence of multiple systems and processes that failed simultaneously and interacted in unexpected ways,” the company wrote. “Our tests showed no impact to the Kubernetes control plane from this change [and] “repair was very slow due to the lock-in effect.”
OpenAI says it will adopt several measures to prevent similar incidents in the future, including improvements to staged deployments with better monitoring of infrastructure changes and recent mechanisms to ensure OpenAI engineers can access the company’s Kubernetes API servers under all circumstances.
“We apologize for the impact this incident has had on all of our customers – from ChatGPT users, to developers, to companies using OpenAI products,” OpenAI wrote. “We did not meet our expectations.”