Google's novel artificial intelligence tool DeepMind uses video pixels and text prompts to generate audio tracks

Share

Google DeepMind he took off the blindfolds a novel AI tool for generating video soundtracks. In addition to using text prompts to generate audio, DeepMind also takes into account video content.

According to DeepMind, by combining the two, users can exploit the tool to create scenes with “dramatic scores, realistic sound effects, or dialogue that match the characters and tone of the film.” You can see some examples on the DeepMind website – they sound pretty good.

For video depicting a car driving through a cyberpunk cityscape, Google used the message “car skidding, car engine throttling, angelic electronic music” to generate the sound. You can see how the skidding sounds interact with the car’s movement. Other example creates underwater soundscape with prompt “jellyfish pulsating underwater, sea life, ocean”.

Although users can include a text prompt, DeepMind says this is optional. Users also do not have to carefully match the generated sound to appropriate scenes. According to DeepMind, the tool can also generate an “unlimited” number of video soundtracks, allowing users to create an infinite number of audio options.

This could lend a hand it stand out from other AI tools, such as ElevenLabs’ sound effects generator, which uses text prompts to generate sound. It could also make it easier to pair audio with AI-generated video using tools like Veo and Sora DeepMind (the latter of which plans to eventually enable audio).

DeepMind says it has trained its AI tool with video, audio and annotations containing “detailed descriptions of sounds and transcriptions of spoken dialogue.” This allows the video-audio generator to match audio events to visual scenes.

The tool still has some limitations. For example, DeepMind is trying to improve its ability to lip-sync with dialogue, as seen here claymation family video. DeepMind also notes that the video-to-audio system it uses depends on image quality, so anything that is grainy or distorted “can lead to a noticeable drop in audio quality.”

The AI Sckool

Categories

Google’s novel artificial intelligence tool DeepMind uses video pixels and text prompts to generate audio tracks

Penalties: Does the team that kicks first have a better chance of winning?

3 questions: Beyond data-driven aesthetics

Almost anyone can now sell you GLP-1 on the Internet

7 Real Python Projects You Can Build in 2026 (with Guides)

Start building with Nano Banana 2 Lite and Gemini Omni Flash

More News

What’s going on with Alexa+?

The winter storm tested power grids that are strained to accommodate AI data centers

Google DeepMind employees ask leaders to ensure their “physical safety” from ICE

Google Photos now lets you describe how to turn images into videos

Penalties: Does the team that kicks first have a better chance of winning?

3 questions: Beyond data-driven aesthetics

Almost anyone can now sell you GLP-1 on the Internet