Speaker Diarization with Pyannote on Vast.ai
Speaker diarization partitions an audio stream into segments according to speaker identity—identifying “who spoke when” in multi-speaker recordings like meetings, podcasts, or interviews. This guide walks through using PyAnnote Audio for speaker diarization on Vast.ai.Prerequisites
- A Vast.ai account with credits
- A Hugging Face account with access tokens
- Accept the model terms at:
When to Use Diarization
Speaker diarization answers “who spoke when”—it doesn’t transcribe what was said. Use diarization when you need to:- Attribute transcribed text to specific speakers
- Analyze speaking patterns (talk time, interruptions, turn-taking)
- Split multi-speaker audio into per-speaker segments for downstream processing
Hardware Requirements
Pyannote’s speaker diarization model is efficient and runs on modest hardware:- GPU: RTX 3060, 4060, or similar
- VRAM: 6-8GB
- System RAM: 8-16GB
- Storage: 10GB minimum
- CUDA: 11.0+
- Python: 3.8+
Setting Up the Instance
- Go to Vast.ai Templates
- Select the
PyTorch (CuDNN Runtime)template - Filter for an instance with:
- 1 GPU
- 6-8GB VRAM
- 8-16GB system RAM
- 10GB storage
- Rent the instance
- Install the Vast TLS certificate in your browser
- Open Jupyter from your instances
Creating a Notebook
- In JupyterLab, click File → New → Notebook
- Select the Python 3 kernel
- Run the following cells in your notebook