
Photo by the author Canva
# Entry
Ai Open Source experiences a significant moment. Thanks to the progress in huge language models, general machine learning, and now speech technology Open Source models narrow the gap with reserved systems. One of the most electrifying participants in this space is the Pile of Open Source Microsoft, Vibevoice. This model family is designed for a natural, expressive and interactive conversation, competing with the quality of commercial offers at the highest level.
In this article, we will examine Vibevoice, download the model and start inference in Google Colab using GPU executive time. In addition, we will solve problems of typical problems that may arise when launching the model’s application.
# Introduction to Vibevoice
Vibevoice It is a fresh generation texture framework for speech (TTS) for creating expressive, long -term, multi -purpose sound, such as podcasts and dialogues. Unlike customary TTS, it is distinguished by scalability, speakers consistency and a natural turn.
Its basic innovation consists of continuous acoustic and semantic tokens operating at 7.5 Hz, combined with a huge language model (QWEN2.5-1.5b) and diffusion heads for generating sound with high faithfulness. This design enables up to 90 minutes of speech with 4 separate speakers, exceeding earlier systems.
Vibevoice is available as an open source model Huggingwith the community maintained by the code for uncomplicated experiments and utilize.

Picture with Vibevoice# First steps from Vibevoice-1.5b
In this guide, we will learn how to clone the Vibevoice repository and run a demo, providing him with a text file to generate natural natural speech. Sound generation only takes about 5 minutes from the configuration.
// 1. Clon the community repository and install
First, clone the social version of the Vibevoice repository (Vibevoice-Community/Vibevoice) Install the required Python packages and also install Hugging the center of the face Library to download the model with Python API.
Note: Before starting the Colab session, make sure that the type of performance is set on the T4 GPU.
!git clone -q --depth 1 https://github.com/vibevoice-community/VibeVoice.git /content/VibeVoice
%pip install -q -e /content/VibeVoice
%pip install -q -U huggingface_hub
// 2. Download the shutter of the Hugging Face model
Download the model repository using the API Snapshot Fave Snapshot interface. It is downloading all files with microsoft/VibeVoice-1.5B warehouse.
from huggingface_hub import snapshot_download
snapshot_download(
"microsoft/VibeVoice-1.5B",
local_dir="/content/models/VibeVoice-1.5B",
local_dir_use_symlinks=False
)
// 3. Create a transcript with speakers
We will create a text file on Google Colab. For this we will utilize a magic function %%writefile To provide content. Below is an example conversation between two speakers about Kdnuggets.
%%writefile /content/my_transcript.txt
Speaker 1: Have you read the latest article on KDnuggets?
Speaker 2: Yes, it's one of the best resources for data science and AI.
Speaker 1: I like how KDnuggets always keeps up with the latest trends.
Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community.
// 4. Start inference (multiple)
Now we will launch a Python demo script at the Vibevoice repository. The script requires a model path, text file and speaker names.
Run 1: Map speaker 1 → Alice, 2 → Frank speaker
!python /content/VibeVoice/demo/inference_from_file.py
--model_path /content/models/VibeVoice-1.5B
--txt_path /content/my_transcript.txt
--speaker_names Alice Frank
As a result, you will see the following output data. The model will utilize wonders to generate sound, and Frank and Alice as two speakers. It will also provide a summary that can be used for analysis.
Using device: cuda
Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
1. Speaker 1
Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
2. Speaker 2
Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
3. Speaker 1
Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
4. Speaker 2
Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....
Speaker mapping:
Speaker 2 -> Frank
Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B
==================================================
GENERATION SUMMARY
==================================================
Input file: /content/my_transcript.txt
Output file: ./outputs/my_transcript_generated.wav
Speaker names: ['Alice', 'Frank']
Number of unique speakers: 2
Number of segments: 4
Prefilling tokens: 368
Generated tokens: 118
Total tokens: 486
Generation time: 28.27 seconds
Audio duration: 15.47 seconds
RTF (Real Time Factor): 1.83x
==================================================
Pick up the sound in the notebook:
We will now utilize the iPithhon function to listen to the generated sound in Colab.
from IPython.display import Audio, display
out_path = "/content/outputs/my_transcript_generated.wav"
display(Audio(out_path))


It took 28 seconds to generate sound and it sounds brilliant, natural and velvety. I love it!
Try again with various voice actors.
Run #2: Try various voices (Mary for Speaker 1, Carter for the speaker 2)
!python /content/VibeVoice/demo/inference_from_file.py
--model_path /content/models/VibeVoice-1.5B
--txt_path /content/my_transcript.txt
--speaker_names Mary Carter
The generated sound was even better, with background music at the beginning and a velvety transition between the speakers.
Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
1. Speaker 1
Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
2. Speaker 2
Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
3. Speaker 1
Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
4. Speaker 2
Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....
Speaker mapping:
Speaker 2 -> Carter
Speaker 1 -> Mary
Speaker 1 ('Mary') -> Voice: en-Mary_woman_bgm.wav
Speaker 2 ('Carter') -> Voice: en-Carter_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B
Tip: If you are not sure which names are available, the script prints “Available voices:” when starting.
Typical includes:
en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
# Problem solving
// 1. Repo has no demo scripts?
The official Microsoft Vibevoice repository has been pulled out and reset. Community reports indicate that some code and demos have been removed or are no longer available in the original location. If it turns out that the official repository lacks examples of inference, check the mirror of the community or archive that has retained the original demonstration versions and instructions: https://github.com/vibevoice-community/vibevoice
// 2. Errors of tardy generation or miracles in Colab
Check if you are in the GPU graphic environment: performance → Change the type of performance → hardware accelerator: GPU (T4 or any GPU available).
// 3
To minimize the load, you can take a few steps. Start by shortening the input text and reducing the generation length. Consider reducing audio sampling speed and/or adjustment of internal sizes if the script allows it. Set the party size to 1 and select a smaller model variant.
// 4. No sound folder or missing output
The script usually prints the final output path in the console; Roll up to find the exact location
find /content -name "*generated.wav"
// 5. The name of the voice was found?
Copy the exact names listed in the available voices. Apply the names of nicknames (Alice, Frank, Mary, Carter) shown in the demo. Correspond to .wav assets.
# Final thoughts
For many projects, I would choose an open source pile, such as Vibevoice on paid API interfaces for several convincing reasons. First of all, it is uncomplicated to integrate and offers adaptation flexibility, making it suitable for a wide range of application. In addition, it is surprisingly lightweight on the GPU requirements, which can be a significant advantage in the environments constrained by resources.
Vibevoice is open source, which means that in the future you can expect better frames that allow faster generation even on processors.
Abid Ali Awan (@1abidaliawan) is a certified scientist who loves to build machine learning models. Currently, it focuses on creating content and writing technical blogs on machine learning and data learning technologies. ABID has a master’s degree in technology management and a bachelor’s title in the field of telecommunications engineering. His vision is to build AI with a neural network for students struggling with mental illness.
