Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.vast.ai/llms.txt

Use this file to discover all available pages before exploring further.

Speaker Diarization with Pyannote on Vast.ai

Speaker diarization partitions an audio stream into segments according to speaker identity-identifying “who spoke when” in multi-speaker recordings like meetings, podcasts, or interviews. This guide walks through using PyAnnote Audio for speaker diarization on Vast.ai.

Prerequisites

When to Use Diarization

Speaker diarization answers “who spoke when”-it doesn’t transcribe what was said. Use diarization when you need to:
  • Attribute transcribed text to specific speakers
  • Analyze speaking patterns (talk time, interruptions, turn-taking)
  • Split multi-speaker audio into per-speaker segments for downstream processing
For full transcription with speaker labels, combine diarization with a speech-to-text model like Whisper: run diarization first to get speaker timestamps, then transcribe each segment.

Hardware Requirements

Pyannote’s speaker diarization model is efficient and runs on modest hardware:
  • GPU: RTX 3060, 4060, or similar
  • VRAM: 6-8GB
  • System RAM: 8-16GB
  • Storage: 10GB minimum
  • CUDA: 11.0+
  • Python: 3.8+

Setting Up the Instance

  1. Go to Vast.ai Templates
  2. Select the PyTorch (CuDNN Runtime) template
  3. Filter for an instance with:
    • 1 GPU
    • 6-8GB VRAM
    • 8-16GB system RAM
    • 10GB storage
  4. Rent the instance
  5. Install the Vast TLS certificate in your browser
  6. Open Jupyter from your instances

Creating a Notebook

  1. In JupyterLab, click File → New → Notebook
  2. Select the Python 3 kernel
  3. Run the following cells in your notebook

Installing Dependencies

Install the required Python packages:
pip install pyannote.audio pydub librosa datasets
Install FFmpeg for audio processing:
apt-get update && apt-get install -y ffmpeg

Downloading Test Data

This example uses a sample from the AMI Meeting Corpus dataset:
from datasets import load_dataset
import os
import soundfile as sf

os.makedirs("ami_samples", exist_ok=True)

dataset = load_dataset("diarizers-community/ami", "ihm", split="train", streaming=True)

samples = list(dataset.take(1))

for i, sample in enumerate(samples):
    audio = sample["audio"]
    audio_array = audio["array"]
    sampling_rate = audio["sampling_rate"]
    duration = len(audio_array) / sampling_rate

    output_path = f"ami_samples/sample_{i}.wav"
    sf.write(output_path, audio_array, sampling_rate)

    print(f"Saved {output_path} - Duration: {duration:.2f} seconds")

Running Speaker Diarization

Initialize the Pipeline

Load the pretrained diarization model and move it to GPU for faster processing:
import torch
from pyannote.audio import Pipeline

HF_TOKEN = "your-huggingface-token"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_TOKEN
)
pipeline = pipeline.to(device)

Process Audio

Run diarization on an audio file. The pipeline returns timestamped speaker segments:
audio_file = "./ami_samples/sample_0.wav"
print(f"Processing {audio_file} on {device}")
output = pipeline(audio_file)

print("Voice activity segments:")
for segment, _, speaker in output.itertracks(yield_label=True):
    print(f"{segment.start:.2f} --> {segment.end:.2f} ({segment.duration:.2f}s) Speaker: {speaker}")
Example output:
Processing ./ami_samples/sample_0.wav on cuda
Voice activity segments:
18.36 --> 18.42 (0.07s) Speaker: SPEAKER_03
23.01 --> 25.63 (2.62s) Speaker: SPEAKER_03
27.08 --> 27.64 (0.56s) Speaker: SPEAKER_05

Analyzing Results

Calculate Speaking Time per Speaker

for speaker in output.labels():
    speaking_time = output.label_duration(speaker)
    print(f"Speaker {speaker}: {speaking_time:.2f}s")
Example output:
Speaker SPEAKER_00: 558.98s
Speaker SPEAKER_01: 18.98s
Speaker SPEAKER_03: 469.68s
Speaker SPEAKER_04: 698.02s

Detect Overlapping Speech

overlap = output.get_overlap()
print(f"Overlapping speech regions: {overlap}")

Filter by Speaker

speaker = "SPEAKER_00"
speaker_turns = output.label_timeline(speaker)
print(f"Speaker {speaker} speaks at:")
for turn in speaker_turns:
    print(turn)

Extracting Speaker Segments

This utility function splits the original audio into separate files for each speaker segment, useful for downstream processing like transcription:
import shutil
from pydub import AudioSegment

def split_audio_by_segments(audio_path, diarization_output, output_dir="output_segments"):
    """
    Split an audio file into multiple files based on diarization output.

    Parameters:
        audio_path: Path to the input audio file
        diarization_output: Pyannote diarization Annotation object
        output_dir: Directory to save the output segments
    """
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.makedirs(output_dir, exist_ok=True)

    audio = AudioSegment.from_file(audio_path)

    for i, (segment, _, speaker) in enumerate(diarization_output.itertracks(yield_label=True)):
        start_ms = int(segment.start * 1000)
        end_ms = int(segment.end * 1000)
        segment_audio = audio[start_ms:end_ms]

        filename = os.path.basename(audio_path)
        name, ext = os.path.splitext(filename)
        output_path = os.path.join(
            output_dir,
            f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms_{speaker}{ext}"
        )

        segment_audio.export(output_path, format=ext.replace('.', ''))
        print(f"Saved: {output_path}")

split_audio_by_segments(audio_file, output)

Playing Audio Segments

Verify results in Jupyter:
import librosa
from IPython.display import Audio, display

def play_audio(file_path, sr=None):
    y, sr = librosa.load(file_path, sr=sr)
    display(Audio(data=y, rate=sr))

# Play a segment
play_audio("output_segments/sample_0_segment_0001_00018360ms-00018420ms_SPEAKER_03.wav")

Additional Resources