Google DeepMind he took off the blindfolds a novel AI tool for generating video soundtracks. In addition to using text prompts to generate audio, DeepMind also takes into account video content.
According to DeepMind, by combining the two, users can exploit the tool to create scenes with “dramatic scores, realistic sound effects, or dialogue that match the characters and tone of the film.” You can see some examples on the DeepMind website – they sound pretty good.
For video depicting a car driving through a cyberpunk cityscape, Google used the message “car skidding, car engine throttling, angelic electronic music” to generate the sound. You can see how the skidding sounds interact with the car’s movement. Other example creates underwater soundscape with prompt “jellyfish pulsating underwater, sea life, ocean”.
Although users can include a text prompt, DeepMind says this is optional. Users also do not have to carefully match the generated sound to appropriate scenes. According to DeepMind, the tool can also generate an “unlimited” number of video soundtracks, allowing users to create an infinite number of audio options.
This could lend a hand it stand out from other AI tools, such as ElevenLabs’ sound effects generator, which uses text prompts to generate sound. It could also make it easier to pair audio with AI-generated video using tools like Veo and Sora DeepMind (the latter of which plans to eventually enable audio).
DeepMind says it has trained its AI tool with video, audio and annotations containing “detailed descriptions of sounds and transcriptions of spoken dialogue.” This allows the video-audio generator to match audio events to visual scenes.
The tool still has some limitations. For example, DeepMind is trying to improve its ability to lip-sync with dialogue, as seen here claymation family video. DeepMind also notes that the video-to-audio system it uses depends on image quality, so anything that is grainy or distorted “can lead to a noticeable drop in audio quality.”
