Whether you’re describing the sound of a faulty car engine or meowing like your neighbor’s cat, imitating sounds with your voice can be a helpful way to convey a concept when words don’t work.
Voice imitation is the audio equivalent of drawing a quick picture to convey something you’ve seen – except instead of using a pencil to illustrate an image, you utilize your vocal system to express sound. It may seem challenging, but we all do it intuitively: to experience it for yourself, try using your voice to mirror the sound of an ambulance siren, a crow or the ring of a bell.
Inspired by cognitive science that studies how we communicate, researchers at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed an artificial intelligence system that can produce human-like voice imitations without any training and without “hearing” the sensation of a human voice.
To achieve this, the researchers designed their system to produce and interpret sounds much like we do. They started by building a model of the human vocal system that simulates how vibrations from the voice box are shaped by the throat, tongue and mouth. They then used a cognitively inspired artificial intelligence algorithm to control the vocal tract model and make it generate imitations, taking into account the context-specific ways in which humans communicate sound.
The model can effectively take many sounds from the world and generate human-like imitations of them – including sounds such as rustling leaves, the hiss of a snake, and an approaching ambulance siren. Their model can also be run in reverse to guess real-world sounds from imitations of the human voice, much like some computer vision systems can obtain high-quality images from sketches. For example, the model can correctly distinguish the sound of a human imitating a cat’s “meow” from its “hissing.”
In the future, this model could potentially lead to more intuitive, “imitation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even methods to aid students learn up-to-date languages.
The co-authors — MIT CSAIL graduate students Kartik Chandra SM ’23 and Karima Ma and undergraduate researcher Matthew Caren — note that computer graphics researchers have long recognized that realism is rarely the ultimate goal of visual expression. For example, an abstract painting or a child’s scribble made in crayon can be as expressive as a photograph.
“Over the past few decades, advances in sketching algorithms have resulted in new tools for artists, advances in artificial intelligence and computer vision, and even a deeper understanding of human cognition,” notes Chandra. “In the same way that a sketch is an abstract, non-photorealistic representation of an image, our method captures the abstract, non-photorealistic way that people express the sounds they hear. This teaches us about the process of auditory abstraction.”
“The goal of this project was to understand and computationally model voice imitation, which we consider the auditory equivalent of sketching in the visual domain,” says Caren.
The art of imitation in three parts
The team developed three increasingly refined versions of the model to compare with imitations of the human voice. First, they created a baseline model whose goal was simply to generate imitations of sounds as close to real-world sounds as possible, but this model did not fit human behavior very well.
The researchers then designed a second “communicative” model. According to Caren, this model takes into account what is characteristic of sound for the listener. For example, you would probably imitate the sound of a motorboat by imitating the rumble of its engine, since this is its most distinctive auditory feature, even if it is not the loudest aspect of the sound (compared to, say, the splashing of water). This second model created imitations that were better than the base version, but the team wanted to improve it even further.
To take their method a step further, the researchers added a final layer of reasoning to the model. “Vocal imitations can sound different depending on the effort you put into them. Producing perfectly accurate sounds takes time and energy,” says Chandra. The researchers’ full model explains this by trying to avoid very quick, raucous, high- or low-pitch utterances that people are less likely to utilize in conversation. The result: human-like imitations that closely match many of the decisions made by humans imitating the same sounds.
After building this model, the team conducted a behavioral experiment to test whether AI- or human-generated voice imitations were perceived as better by humans. It is worth noting that the participants of the experiment generally preferred the artificial intelligence model in 25% of cases, as much as 75% in favor of the imitation of a speedboat and in 50% of cases in favor of the imitation of a gunshot.
Towards more expressive sound technology
Caren, who is passionate about technology in music and art, predicts that this model could aid artists better convey sounds to computing systems and aid filmmakers and other content creators generate AI sounds that are more context-specific. It could also enable a musician to quickly search a database of sounds by imitating noise that is challenging to describe with, say, a text prompt.
Meanwhile, Caren, Chandra and Ma are looking at the implications of their model in other areas, including language development, how infants learn to speak, and even imitative behavior in birds such as parrots and songbirds.
The team still has something to work on with the current version of their model: it has issues with some consonants, such as “z,” which led to misleading impressions of some sounds, such as the buzzing of bees. They cannot yet replicate the way humans imitate speech, music or sounds that are imitated differently in different languages, such as a heartbeat.
Stanford University linguistics professor Robert Hawkins says language is full of onomatopoeia and words that imitate, but not fully, what they describe, such as the sound “meow,” which closely resembles the sound cats make. “The processes that take us from the sound of a real cat to a word like ‘meow’ reveal much about the complex interplay between physiology, social reasoning, and communication in the evolution of language,” says Hawkins, who was not involved in the CSAIL study. “This model represents an exciting step toward formalizing and testing the theory of these processes, demonstrating that both physical constraints from the human vocal system and social pressures from communication are needed to explain the distribution of vocal imitations.”
Caren, Chandra and Ma wrote the paper with two other CSAIL staff members: Jonathan Ragan-Kelley, an associate professor in MIT’s Department of Electrical Engineering and Computer Science, and Joshua Tenenbaum, a professor at MIT Brain and Cognitive Sciences and a member of the Center for Brain, Minds and Machines. Their work was supported in part by the Hertz Foundation and the National Science Foundation. It was presented at SIGGRAPH Asia in early December.