Technologies
Our pioneering speech technologies support people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
Speech is crucial to interpersonal connections. It helps people around the world exchange information and ideas, express emotions and build mutual understanding. As we improve our natural, vigorous voice generation technology, we deliver richer and more immersive digital experiences.
Over the past few years, we have been pushing the boundaries of sound generation by developing models that can create natural, high-quality speech from a variety of inputs such as text, tempo control, and specific voices. This technology supports audio from a single speaker across many Google products and experiments, including: Gemini liveProject Astra, Voices of travel AND YouTube auto copy — and helps people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
Working with partners at Google, we recently helped develop two novel features that can generate long dialogues with multiple speakers, making complicated content more accessible:
- Notebook Audio ReviewsLM turns submitted documents into an engaging and lively dialogue. With one click, two AI hosts summarize user material, make connections between topics, and banter back and forth.
- Illuminate creates formal AI-generated discussions about research articles to make knowledge more accessible and digestible.
Here we provide an overview of our latest speech generation research on which all of these experimental products and tools are based.
Pioneering sound generation techniques
We have been investing in sound generation research for years and discovering novel ways to generate more natural dialogue in our products and experimental tools. In our previous research on SoundStormwe first demonstrated the ability to generate 30-second segments of natural dialogue between multiple speakers.
This extended our previous work, Sound stream AND AudioLMwhich allowed us to apply many text language modeling techniques to the sound generation problem.
SoundStream is a neural audio codec that effectively compresses and decompresses the audio input signal without losing its quality. As part of the training process, SoundStream learns how to map sound to a series of acoustic tokens. These tokens capture all the information needed to reconstruct audio with high fidelity, including properties such as prosody AND timbre.
AudioLM treats audio generation as a language modeling task to produce acoustic tokens for codecs such as SoundStream. As a result, the AudioLM environment makes no assumptions about the type or composition of the sound generated and can flexibly handle a variety of sounds without the need to adapt the architecture, making it a good candidate for modeling multi-speaker dialogues.
Based on this research, our latest speech generation technology can generate 2 minutes of dialogue with better naturalness, speaker coherence and acoustic quality if provided with a dialogue script and speaker turn markers. The model also completes this task in less than 3 seconds on a single Tensor Processing Unit (TPU) v5ein one inference pass. This means it generates sound over 40 times faster than in real time.
Scaling our sound generation models
Scaling our single-speaker generation models to multi-speaker models then became a matter of data and model capacity. To support our latest speech generation model produce longer speech segments, we have created an even more competent speech codec to compress audio into a sequence of tokens at just 600 bits per second, without sacrificing output quality.
The tokens produced by our codec have a hierarchical structure and are grouped by time frame. The first tokens in the group capture phonetic and prosodic information, while the last tokens encode fine acoustic details.
Even with our novel speech codec, creating a 2-minute dialogue requires generating over 5,000 tokens. To model these long sequences, we have developed a specialized method Transformer an architecture that efficiently deals with information hierarchies that match the structure of our acoustic tokens.
With this technique, we can efficiently generate acoustic tokens corresponding to dialogue in a single run of autoregressive inference. Once generated, these tokens can be decoded back into a sound wave using our speech codec.
To train our model to generate realistic exchanges between multiple speakers, we pre-trained it on hundreds of thousands of hours of speech data. We then refined it on a much smaller dialogue dataset with high acoustic quality and precise speaker annotations, consisting of unscripted conversations of multiple voice actors and realistic illiquidity — the “umms” and “aahs” of real conversation. This step taught the model how to reliably switch between speakers during generated dialogue and play only studio-quality audio with realistic pauses, pitch and timing.
According to ours Principles of artificial intelligence and our commitment to the responsible development and deployment of AI technology, we exploit our SynthID technology to watermark non-transitory AI-generated audio content from these models to support protect against potential misuse of this technology.
Novel speech experiences lie ahead
We are now focusing on improving the smoothness and acoustic quality of our model and adding more precise controls for features such as prosody, while exploring how best to combine these advances with other modalities such as video.
The potential applications of advanced speech generation are enormous, especially when combined with our Gemini family of models. From improving learning experiences to making content more universally accessible, we’re excited to continue pushing the boundaries of what’s possible with voice technologies.