Tuesday, April 28, 2026

Local transcription of the whisper sound

Share


Photo by the author

# Entry

Audio-to-text transcription is a common need for developers, whether you’re building a voice-to-text application, analyzing meeting recordings, or adding subtitles to videos. Doing it locally (on your own computer) protects your privacy and avoids recurring cloud costs.

In this article, you’ll learn how to set up a speedy, local transcription system using Whisper and its optimized version called A faster whisper. We’ll cover audio pre-processing such as MP3 to WAV conversion, write a Python script, and cover operation on both CPU and GPU.

# What is a whisper? And why exploit the local variant?

OpenAI Whisper is an automatic speech recognition (ASR) model. It is trained with plenty of multilingual audio and works well even with background noise or different accents.
However, the original Whisper can be tardy on the CPU and consume a significant amount of memory. This is where optimized variants come in handy.

  • whisper.cpp it is written in C++ without ponderous dependencies. It is very speedy on the CPU, but requires compilation and is less Python genial.
  • A faster whisper is a re-implementation of usage CTranslate2. It runs up to 4 times faster than the original Whisper, uses less RAM, and works seamlessly with Python. In this tutorial we will be using Faster-Whisper.

Both variants run 100% locally; no data leaves your computer.

# Setting up your environment (cross-platform)

This configuration works on Windows, macOS, and Linux with Python 3.8 or later. Create and activate a virtual environment (optional but recommended):

python -m venv whisper_env

Activate the virtual environment on macOS and Linux:

source whisper_env/bin/activate

On Windows:

whisper_envScriptsactivate

Install Faster-Whisper:

pip install faster-whisper

// Installing audio pre-processing tools

Whisper expects audio in 16 kHz mono WAV format. To convert popular formats (MP3, M4A, OGG, etc.) we need FFmpeg and Python library pydub.

Install FFmpeg:

  • On Windows, download from FFmpeg.org and add to PATH or exploit winget install ffmpeg.
  • macOS: brew install ffmpeg
  • Linux (Ubuntu/Debian): sudo apt install ffmpeg

Then install pydub:

// Optional GPU support

If you have an NVIDIA GPU and want faster transcription, install cuBLAS and cuDNN by following the instructions Faster-Whisper GPU Guide. Without this, the code automatically returns to the processor.

# Audio Preprocessing: Converting non-WAV files

Most audio files you encounter are not the raw WAV format. They exploit compression formats (MP3) or container formats (M4A). They must be converted to 16 kHz, mono, PCM WAV before uploading to Whisper.

Below is a Python function that uses pydub (which calls FFmpeg in the background) to perform this conversion.

from pydub import AudioSegment
import os

def convert_to_wav(input_path, output_path=None):
    """
    Convert any audio file (MP3, M4A, OGG, etc.) to WAV (16 kHz, mono).
    If output_path is None, replaces extension with .wav in the same folder.
    """
    if output_path is None:
        base, _ = os.path.splitext(input_path)
        output_path = base + ".wav"

    # Load audio (pydub uses ffmpeg)
    audio = AudioSegment.from_file(input_path)

    # Convert to mono and set sample rate to 16000 Hz
    audio = audio.set_channels(1).set_frame_rate(16000)

    # Export as WAV
    audio.export(output_path, format="wav")
    return output_path

Usage example:

wav_file = convert_to_wav("meeting.mp3")
print(f"Converted to: {wav_file}")

# Basic transcription script from Faster-Whisper

Let’s now write a complete Python script that loads the Whisper model, transcribes the WAV file, and prints the result.

from faster_whisper import WhisperModel

def transcribe_audio(wav_path, model_size="base", device="cpu"):
    """
    Transcribe a WAV file (16 kHz mono) using Faster-Whisper.
    model_size: "tiny", "base", "small", "medium", "large-v2", "large-v3"
    device: "cpu" or "cuda" (if GPU is available)
    """
    # Initialize model (downloads automatically on first exploit)
    model = WhisperModel(model_size, device=device, compute_type="int8")

    # Run transcription
    segments, info = model.transcribe(wav_path, beam_size=5, language="en")

    print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
    print("nTranscription:")
    for segment in segments:
        print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

    # Return full text if needed
    full_text = " ".join([seg.text for seg in segments])
    return full_text

# Example usage
if __name__ == "__main__":
    text = transcribe_audio("my_recording.wav", model_size="small", device="cpu")

What is happening in the above code?

  • WhisperModel downloads the selected model (e.g petite) To ~/.cache/huggingface/hub in the first run.
  • beam_size=5 balances accuracy and speed. Higher values ​​(e.g. 10) are slower but more exact.
  • compute_type="int8" uses 8-bit integer math for faster inference. For GPU you can try "float16".
Device Speed Complexity of configuration Recommended for
Processor Slower (but okay for files shorter than 10 minutes) None (just install) Beginners, laptops, petite projects
Graphics processor (CUDA) 3-5 times faster Requires NVIDIA, cuBLAS, cuDNN drivers Long files, batch transcription

To exploit GPU, change device="cuda" in the code. Faster-Whisper automatically detects CUDA if installed correctly.

Tip: Even on the Faster-Whisper CPU it is much faster than the original Whisper. For a 10-minute MP3 file, the basic model takes about 2 minutes on a newfangled processor.

# Convert MP3 to Transcript: Complete Example

Here is the full script that will convert any audio file to WAV format and then transcribe it.

import os
from pydub import AudioSegment
from faster_whisper import WhisperModel

def convert_to_wav(input_path):
    """Convert any audio to 16kHz mono WAV."""
    audio = AudioSegment.from_file(input_path)
    audio = audio.set_channels(1).set_frame_rate(16000)
    wav_path = os.path.splitext(input_path)[0] + ".wav"
    audio.export(wav_path, format="wav")
    return wav_path

def transcribe_file(audio_path, model_size="base", device="cpu"):
    # Step 1: Convert if not already WAV
    if not audio_path.lower().endswith(".wav"):
        print(f"Converting {audio_path} to WAV...")
        audio_path = convert_to_wav(audio_path)

    # Step 2: Transcribe
    print(f"Loading model '{model_size}' on {device.upper()}...")
    model = WhisperModel(model_size, device=device, compute_type="int8")
    segments, info = model.transcribe(audio_path, beam_size=5)

    print(f"nLanguage: {info.language} (prob: {info.language_probability:.2f})")
    print("nTranscript:")
    for seg in segments:
        print(seg.text, end=" ", flush=True)
    print()  # final newline

if __name__ == "__main__":
    # Example: transcribe an MP3 file
    transcribe_file("interview.mp3", model_size="small", device="cpu")

Save this as transcribe.py and run:

The script will download the model once, convert the file and generate a transcript.

# Application

You now have a local, speedy and private audio transcription system. Some key takeaways:

  • Faster-Whisper enables near real-time transcription on the CPU and excellent speed on the GPU.
  • Always pre-process audio to 16 kHz mono WAV using pydub and FFmpeg.
  • The model_size parameter trades accuracy for speed – start with "base" Or "small".
  • Running locally means no API keys, no data sharing, and no monthly fees.

Try it differently Whisper sizes of models for better accuracy. Add speaker diarization (identifying who spoke when) using libraries such as pyannote.audio. Build a plain web interface with Built Or Streamlined.

Shittu Olumid is a software engineer and technical writer with a passion for using cutting-edge technology to create compelling narratives, with an eye for detail and a knack for simplifying elaborate concepts. You can also find Shittu on Twitter.

Latest Posts

More News