On Saturday, o Associated Press investigation revealed that OpenAI’s Whisper transcription tool produces fabricated text in medical and business settings despite warnings against such exploit. The AP interviewed more than 12 software engineers, programmers and researchers who found that the model regularly comes up with text that its interlocutors never spoke, a phenomenon often called “confabulation” or “hallucination” in the AI field.
On your own release in 2022, OpenAI stated that Whisper achieved “human-level robustness” in audio transcription accuracy. However, a University of Michigan researcher told the AP that Whisper created fake text in 80 percent of public meeting transcripts examined. Another coder, unnamed in the AP report, claimed to have found made-up content in almost all of the 26,000 test transcripts.
Factories pose particular risks in healthcare settings. Despite OpenAI’s warnings against using Whisper for “high-risk domains“The AP report found that more than 30,000 medical workers currently exploit Whisper-based tools to record patient visits. Minnesota’s Mankato Clinic and Children’s Hospital Los Angeles are among 40 health systems using the health technology company’s whisper-powered AI co-pilot service Nabla this is refined in terms of medical terminology.
Nabla admits that Whisper may be confabulating, but it also reportedly deletes original audio recordings “for data security reasons.” This can cause additional problems because doctors cannot verify accuracy against the source material. Deaf patients can be greatly affected by incorrect transcriptions as they would have no way of knowing whether the audio of the medical transcription is true or not.
The potential problems with Whisper extend beyond health care. Scientists from Cornell University and the University of Virginia studied thousands of audio samples and discovered that Whisper had added non-existent content containing violence and racist comments to neutral speech. They found that 1 percent of the samples contained “entire hallucinatory phrases or sentences that did not exist in any form in the underlying audio material,” and 38 percent of the samples included “clear harms such as perpetuating violence, creating inaccurate associations, or suggesting false authority.”
In one instance in the study cited by the AP, when the speaker described “two other girls and one woman,” Whisper added fictitious text specifying that “they were black.” Another recording said: “He, the boy, was going, I’m not sure exactly, to take the umbrella.” Whisper wrote it this way: “He took a big piece of the cross, a tiny little piece… I’m sure he didn’t have a terrorist knife, so he killed a lot of people.”
An OpenAI spokesman told the AP that the company appreciates the researchers’ findings and is actively examining how to reduce the number of factories and incorporates feedback into model updates.
Why whispers confabulate
The key to Whisper’s unsuitability for high-risk fields is its tendency to sometimes confabulate or plausibly fabricate inexact results. The AP report says, “Scientists aren’t sure why Whisper and tools like it cause hallucinations,” but that’s not true. We know exactly why Transformer based AI models like Whisper behave this way.
Whisper is based on technology that is designed to predict the next most likely token (piece of data) that should appear after a sequence of tokens provided by the user. In the case of ChatGPT, input tokens are in the form of a text prompt. For Whisper, the input is tokenized audio data.