Deploy LLMs with dstack and vLLM on Vast.ai
dstack is an open-source GPU orchestration platform that simplifies deploying AI workloads across cloud providers. This guide shows you how to use dstack with Vast.ai as the backend to deploy language models using vLLM, with automated provisioning and cost controls.Why Use dstack with Vast.ai?
- Simplified Deployment: Define your model configuration in YAML, and dstack handles instance provisioning
- Cost Controls: Set maximum hourly price limits and dstack finds the best available instances
- OpenAI-Compatible API: vLLM provides a standard API that works with existing tools and SDKs
- Automatic Proxy: dstack proxies requests to your service, handling authentication automatically
Prerequisites
- A Vast.ai account with credits (Sign up here)
- Your Vast.ai API key (from Account Settings)
- Python 3.11 (dstack has compatibility issues with Python 3.14)
Hardware Requirements
This guide uses Qwen3-30B-A3B as an example. It’s a Mixture-of-Experts model with 30.5B total parameters.- VRAM Required: ~57GB for model weights + KV cache
- Recommended GPU: H100 80GB or A100 80GB
Always check the model card on Hugging Face for VRAM requirements before deploying. A rough estimate: model parameters × 2 bytes for BF16 precision.
Setup
Step 1: Create Virtual Environment and Install dstack
Step 2: Configure dstack Server
Create the server configuration directory and file:~/.dstack/server/config.yml:
YOUR_VASTAI_API_KEY with your actual Vast.ai API key.
Step 3: Start dstack Server
Save the admin token from the output. You’ll need it for CLI access and API authentication.
Step 4: Configure CLI Access
In a new terminal, configure the CLI to connect to your dstack server:Deploy a Model Service
Step 1: Create Service Configuration
Createserve-qwen.dstack.yml:
type: service- Creates a long-running service with HTTP endpointpython: "3.11"- Uses Python 3.11 for compatibilitycommands- Install vLLM and start the model serverport: 8000- The port vLLM serves onresources.gpu: 80GB- Minimum GPU memory requiredmax_price: 2.50- Maximum hourly cost in USD
Step 2: Deploy the Service
Step 3: Monitor Deployment
Check deployment status:Using the API
dstack automatically proxies requests to your service through the dstack server.Chat Completions with cURL
Python Integration
Using the OpenAI Python SDK:Streaming Responses
Cost Management
Themax_price setting in your configuration caps your hourly cost. dstack will only provision instances at or below this price.
Managing Services
Stop a Service
Useful Commands
| Command | Description |
|---|---|
dstack ps | List running services |
dstack logs <name> | View service logs |
dstack stop <name> | Stop a service |
dstack apply -f <file> -y | Deploy without confirmation |
Deploying Other Models
To deploy a different model, modify the configuration file:- Check the model’s VRAM requirements on Hugging Face
- Set appropriate GPU memory in
resources.gpu - Adjust
max_pricebased on GPU tier needed