Guide for beginners after Vibevoice

Photo by the author Canva

# Entry

Ai Open Source experiences a significant moment. Thanks to the progress in huge language models, general machine learning, and now speech technology Open Source models narrow the gap with reserved systems. One of the most electrifying participants in this space is the Pile of Open Source Microsoft, Vibevoice. This model family is designed for a natural, expressive and interactive conversation, competing with the quality of commercial offers at the highest level.

In this article, we will examine Vibevoice, download the model and start inference in Google Colab using GPU executive time. In addition, we will solve problems of typical problems that may arise when launching the model’s application.

# Introduction to Vibevoice

Vibevoice It is a fresh generation texture framework for speech (TTS) for creating expressive, long -term, multi -purpose sound, such as podcasts and dialogues. Unlike customary TTS, it is distinguished by scalability, speakers consistency and a natural turn.

Its basic innovation consists of continuous acoustic and semantic tokens operating at 7.5 Hz, combined with a huge language model (QWEN2.5-1.5b) and diffusion heads for generating sound with high faithfulness. This design enables up to 90 minutes of speech with 4 separate speakers, exceeding earlier systems.

Vibevoice is available as an open source model Huggingwith the community maintained by the code for uncomplicated experiments and utilize.

Picture with Vibevoice

# First steps from Vibevoice-1.5b

In this guide, we will learn how to clone the Vibevoice repository and run a demo, providing him with a text file to generate natural natural speech. Sound generation only takes about 5 minutes from the configuration.

// 1. Clon the community repository and install

First, clone the social version of the Vibevoice repository (Vibevoice-Community/Vibevoice) Install the required Python packages and also install Hugging the center of the face Library to download the model with Python API.

Note: Before starting the Colab session, make sure that the type of performance is set on the T4 GPU.

!git clone -q --depth 1 https://github.com/vibevoice-community/VibeVoice.git /content/VibeVoice
%pip install -q -e /content/VibeVoice
%pip install -q -U huggingface_hub

// 2. Download the shutter of the Hugging Face model

Download the model repository using the API Snapshot Fave Snapshot interface. It is downloading all files with microsoft/VibeVoice-1.5B warehouse.

from huggingface_hub import snapshot_download
snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="/content/models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)

// 3. Create a transcript with speakers

We will create a text file on Google Colab. For this we will utilize a magic function %%writefile To provide content. Below is an example conversation between two speakers about Kdnuggets.

%%writefile /content/my_transcript.txt
Speaker 1: Have you read the latest article on KDnuggets?
Speaker 2: Yes, it's one of the best resources for data science and AI.
Speaker 1: I like how KDnuggets always keeps up with the latest trends.
Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community.

// 4. Start inference (multiple)

Now we will launch a Python demo script at the Vibevoice repository. The script requires a model path, text file and speaker names.

Run 1: Map speaker 1 → Alice, 2 → Frank speaker

!python /content/VibeVoice/demo/inference_from_file.py 
  --model_path /content/models/VibeVoice-1.5B 
  --txt_path /content/my_transcript.txt 
  --speaker_names Alice Frank

As a result, you will see the following output data. The model will utilize wonders to generate sound, and Frank and Alice as two speakers. It will also provide a summary that can be used for analysis.

Using device: cuda
Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
  2. Speaker 2
     Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
  3. Speaker 1
     Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
  4. Speaker 2
     Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....

Speaker mapping:
  Speaker 2 -> Frank
  Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B
==================================================
GENERATION SUMMARY
==================================================
Input file: /content/my_transcript.txt
Output file: ./outputs/my_transcript_generated.wav
Speaker names: ['Alice', 'Frank']
Number of unique speakers: 2
Number of segments: 4
Prefilling tokens: 368
Generated tokens: 118
Total tokens: 486
Generation time: 28.27 seconds
Audio duration: 15.47 seconds
RTF (Real Time Factor): 1.83x
==================================================

Pick up the sound in the notebook:

We will now utilize the iPithhon function to listen to the generated sound in Colab.

from IPython.display import Audio, display
out_path = "/content/outputs/my_transcript_generated.wav"
display(Audio(out_path))

It took 28 seconds to generate sound and it sounds brilliant, natural and velvety. I love it!

Try again with various voice actors.

Run #2: Try various voices (Mary for Speaker 1, Carter for the speaker 2)

!python /content/VibeVoice/demo/inference_from_file.py 
  --model_path /content/models/VibeVoice-1.5B 
  --txt_path /content/my_transcript.txt 
  --speaker_names Mary Carter

The generated sound was even better, with background music at the beginning and a velvety transition between the speakers.

Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
  2. Speaker 2
     Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
  3. Speaker 1
     Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
  4. Speaker 2
     Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....

Speaker mapping:
  Speaker 2 -> Carter
  Speaker 1 -> Mary
Speaker 1 ('Mary') -> Voice: en-Mary_woman_bgm.wav
Speaker 2 ('Carter') -> Voice: en-Carter_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B

Tip: If you are not sure which names are available, the script prints “Available voices:” when starting.

Typical includes:

en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman

# Problem solving

// 1. Repo has no demo scripts?

The official Microsoft Vibevoice repository has been pulled out and reset. Community reports indicate that some code and demos have been removed or are no longer available in the original location. If it turns out that the official repository lacks examples of inference, check the mirror of the community or archive that has retained the original demonstration versions and instructions: https://github.com/vibevoice-community/vibevoice

// 2. Errors of tardy generation or miracles in Colab

Check if you are in the GPU graphic environment: performance → Change the type of performance → hardware accelerator: GPU (T4 or any GPU available).

// 3

To minimize the load, you can take a few steps. Start by shortening the input text and reducing the generation length. Consider reducing audio sampling speed and/or adjustment of internal sizes if the script allows it. Set the party size to 1 and select a smaller model variant.

// 4. No sound folder or missing output

The script usually prints the final output path in the console; Roll up to find the exact location

find /content -name "*generated.wav"

// 5. The name of the voice was found?

Copy the exact names listed in the available voices. Apply the names of nicknames (Alice, Frank, Mary, Carter) shown in the demo. Correspond to .wav assets.

# Final thoughts

For many projects, I would choose an open source pile, such as Vibevoice on paid API interfaces for several convincing reasons. First of all, it is uncomplicated to integrate and offers adaptation flexibility, making it suitable for a wide range of application. In addition, it is surprisingly lightweight on the GPU requirements, which can be a significant advantage in the environments constrained by resources.

Vibevoice is open source, which means that in the future you can expect better frames that allow faster generation even on processors.

Abid Ali Awan (@1abidaliawan) is a certified scientist who loves to build machine learning models. Currently, it focuses on creating content and writing technical blogs on machine learning and data learning technologies. ABID has a master’s degree in technology management and a bachelor’s title in the field of telecommunications engineering. His vision is to build AI with a neural network for students struggling with mental illness.

Categories

Guide for beginners after Vibevoice

# Entry

# Introduction to Vibevoice

# First steps from Vibevoice-1.5b

// 1. Clon the community repository and install

// 2. Download the shutter of the Hugging Face model

// 3. Create a transcript with speakers

// 4. Start inference (multiple)

# Problem solving

// 1. Repo has no demo scripts?

// 2. Errors of tardy generation or miracles in Colab

// 3

// 4. No sound folder or missing output

// 5. The name of the voice was found?

# Final thoughts

OpenAI really wants Codex to stop talking about goblins

Elon Musk Testifies He Launched OpenAI to Prevent ‘Terminator Outcome’

‘It’s undignified’: Hundreds of workers training Meta’s artificial intelligence could be fired

Local transcription of the whisper sound

The United Arab Emirates is leaving OPEC after almost 60 years

More News

Local transcription of the whisper sound

A brain implant for depression will soon be tested on humans

10 Python Libraries for Building LLM Applications

The war in Iran is affecting the environment in undetectable ways

OpenAI really wants Codex to stop talking about goblins

Elon Musk Testifies He Launched OpenAI to Prevent ‘Terminator Outcome’

‘It’s undignified’: Hundreds of workers training Meta’s artificial intelligence could be fired