Documentation Index
Fetch the complete documentation index at: https://docs.vast.ai/llms.txt
Use this file to discover all available pages before exploring further.
Speaker Diarization with Pyannote on Vast.ai
Speaker diarization partitions an audio stream into segments according to speaker identity-identifying “who spoke when” in multi-speaker recordings like meetings, podcasts, or interviews.
This guide walks through using PyAnnote Audio for speaker diarization on Vast.ai.
Prerequisites
- A Vast.ai account with credits
- A Hugging Face account with access tokens
- Accept the model terms at:
When to Use Diarization
Speaker diarization answers “who spoke when”-it doesn’t transcribe what was said. Use diarization when you need to:
- Attribute transcribed text to specific speakers
- Analyze speaking patterns (talk time, interruptions, turn-taking)
- Split multi-speaker audio into per-speaker segments for downstream processing
For full transcription with speaker labels, combine diarization with a speech-to-text model like Whisper: run diarization first to get speaker timestamps, then transcribe each segment.
Hardware Requirements
Pyannote’s speaker diarization model is efficient and runs on modest hardware:
- GPU: RTX 3060, 4060, or similar
- VRAM: 6-8GB
- System RAM: 8-16GB
- Storage: 10GB minimum
- CUDA: 11.0+
- Python: 3.8+
Setting Up the Instance
- Go to Vast.ai Templates
- Select the
PyTorch (CuDNN Runtime) template
- Filter for an instance with:
- 1 GPU
- 6-8GB VRAM
- 8-16GB system RAM
- 10GB storage
- Rent the instance
- Install the Vast TLS certificate in your browser
- Open Jupyter from your instances
Creating a Notebook
- In JupyterLab, click File → New → Notebook
- Select the Python 3 kernel
- Run the following cells in your notebook
Installing Dependencies
Install the required Python packages:
pip install pyannote.audio pydub librosa datasets
Install FFmpeg for audio processing:
apt-get update && apt-get install -y ffmpeg
Downloading Test Data
This example uses a sample from the AMI Meeting Corpus dataset:
from datasets import load_dataset
import os
import soundfile as sf
os.makedirs("ami_samples", exist_ok=True)
dataset = load_dataset("diarizers-community/ami", "ihm", split="train", streaming=True)
samples = list(dataset.take(1))
for i, sample in enumerate(samples):
audio = sample["audio"]
audio_array = audio["array"]
sampling_rate = audio["sampling_rate"]
duration = len(audio_array) / sampling_rate
output_path = f"ami_samples/sample_{i}.wav"
sf.write(output_path, audio_array, sampling_rate)
print(f"Saved {output_path} - Duration: {duration:.2f} seconds")
Running Speaker Diarization
Initialize the Pipeline
Load the pretrained diarization model and move it to GPU for faster processing:
import torch
from pyannote.audio import Pipeline
HF_TOKEN = "your-huggingface-token"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=HF_TOKEN
)
pipeline = pipeline.to(device)
Process Audio
Run diarization on an audio file. The pipeline returns timestamped speaker segments:
audio_file = "./ami_samples/sample_0.wav"
print(f"Processing {audio_file} on {device}")
output = pipeline(audio_file)
print("Voice activity segments:")
for segment, _, speaker in output.itertracks(yield_label=True):
print(f"{segment.start:.2f} --> {segment.end:.2f} ({segment.duration:.2f}s) Speaker: {speaker}")
Example output:
Processing ./ami_samples/sample_0.wav on cuda
Voice activity segments:
18.36 --> 18.42 (0.07s) Speaker: SPEAKER_03
23.01 --> 25.63 (2.62s) Speaker: SPEAKER_03
27.08 --> 27.64 (0.56s) Speaker: SPEAKER_05
Analyzing Results
Calculate Speaking Time per Speaker
for speaker in output.labels():
speaking_time = output.label_duration(speaker)
print(f"Speaker {speaker}: {speaking_time:.2f}s")
Example output:
Speaker SPEAKER_00: 558.98s
Speaker SPEAKER_01: 18.98s
Speaker SPEAKER_03: 469.68s
Speaker SPEAKER_04: 698.02s
Detect Overlapping Speech
overlap = output.get_overlap()
print(f"Overlapping speech regions: {overlap}")
Filter by Speaker
speaker = "SPEAKER_00"
speaker_turns = output.label_timeline(speaker)
print(f"Speaker {speaker} speaks at:")
for turn in speaker_turns:
print(turn)
This utility function splits the original audio into separate files for each speaker segment, useful for downstream processing like transcription:
import shutil
from pydub import AudioSegment
def split_audio_by_segments(audio_path, diarization_output, output_dir="output_segments"):
"""
Split an audio file into multiple files based on diarization output.
Parameters:
audio_path: Path to the input audio file
diarization_output: Pyannote diarization Annotation object
output_dir: Directory to save the output segments
"""
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.makedirs(output_dir, exist_ok=True)
audio = AudioSegment.from_file(audio_path)
for i, (segment, _, speaker) in enumerate(diarization_output.itertracks(yield_label=True)):
start_ms = int(segment.start * 1000)
end_ms = int(segment.end * 1000)
segment_audio = audio[start_ms:end_ms]
filename = os.path.basename(audio_path)
name, ext = os.path.splitext(filename)
output_path = os.path.join(
output_dir,
f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms_{speaker}{ext}"
)
segment_audio.export(output_path, format=ext.replace('.', ''))
print(f"Saved: {output_path}")
split_audio_by_segments(audio_file, output)
Playing Audio Segments
Verify results in Jupyter:
import librosa
from IPython.display import Audio, display
def play_audio(file_path, sr=None):
y, sr = librosa.load(file_path, sr=sr)
display(Audio(data=y, rate=sr))
# Play a segment
play_audio("output_segments/sample_0_segment_0001_00018360ms-00018420ms_SPEAKER_03.wav")
Additional Resources