Prerequisites
Before you begin, make sure you have:Vast.ai Account
Sign up at cloud.vast.ai and add credits to your account
API Key
Generate an API key from your account settings
HuggingFace Token
Create a HuggingFace account and generate a read-access token for gated models
Configuration
Install the Vast SDK
Install the SDK that you’ll use to interact with your serverless endpoints:The SDK provides an async Python interface for making requests to your endpoints. You’ll use this after setting up your infrastructure.
API Key Setup
Set your Vast.ai API key as an environment variable:HuggingFace Token Setup
Many popular models like Llama and Mistral require authentication to download. Configure your HuggingFace token once at the account level:- Navigate to your Account Settings
- Expand the “Environment Variables” section
- Add a new variable:
- Key:
HF_TOKEN - Value: Your HuggingFace read-access token
- Key:
- Click the ”+” button, then “Save Edits”
This token will be securely available to all your serverless workers. You only need to set it once for your account.
Deploy Your First Endpoint
1
Create an Endpoint
Navigate to the Serverless Dashboard and click “Create Endpoint”.Use these recommended settings for your first deployment:

Click “Next” to proceed.
| Setting | Value | Description |
|---|---|---|
| Endpoint Name | vLLM-Qwen3-8B | Choose a descriptive name for your endpoint |
| Cold Multiplier | 3 | Scales capacity based on predicted load |
| Cold Workers | 5 | Pre-loaded instances for instant scaling |
| Max Workers | 16 | Maximum GPU instances |
| Minimum Load | 3000 | Baseline tokens/second capacity |
| Target Utilization | 0.9 | Resource usage target (90%) |

2
Create a Workergroup
From the Serverless page, click ”+ Workergroup” under your endpoint.Select the vLLM (Serverless) template, which comes pre-configured with:
Click “Next” to proceed with the default settings.
- Model: Qwen/Qwen3-8B (8 billion parameter LLM)
- Framework: vLLM for high-performance inference
- API: OpenAI-compatible endpoints

3
Wait for Workers to Initialize
Your serverless infrastructure is now being provisioned. This process takes time as workers need to:
- Start up the GPU instances
- Download the model (8GB for Qwen3-8B)
- Load the model into GPU memory
- Complete health checks
- Stopped: Worker has the model loaded and is ready to activate on-demand (cold worker)
- Loading: Worker is starting up and loading the model into GPU memory
- Ready: Worker is active and handling requests
- Click on any worker in the dashboard to view its logs
- Logs show model download progress, loading status, and any startup errors
-
This helps identify issues early rather than waiting for timeouts

The SDK automatically holds and retries requests until workers are ready. However, for best performance, wait for at least one worker to show “Ready” or “Stopped” status before making your first call.
Make Your First API Call
Basic Usage
With the SDK installed, here’s how to make your first API call:The SDK handles all the routing, worker assignment, and authentication automatically. You just need to specify your endpoint name and make requests.
Troubleshooting
Workers stuck in 'Loading' state
Workers stuck in 'Loading' state
- Check if the GPU has enough VRAM for your model
- Verify your model name is correct
- Check worker logs in the dashboard by clicking on the worker
- Ensure your HF_TOKEN is properly configured for gated models
'No workers available' error
'No workers available' error
- The SDK automatically retries requests until workers are ready
- If this persists, check endpoint status in the Serverless Dashboard
- Verify workers are not stuck in “Loading” state (see troubleshooting above)
Slow response times
Slow response times
- First request may take longer as workers activate from cold state
- Increase
max_workersif all workers are full with requests - Increase
min_loadif there aren’t enough workers immediately available when multiple requests are sent - If there are large spikes of requests, increase
cold_workersor decrease target utilization - Consider worker region placement relative to your users
Need help? Join our Discord community or check the detailed documentation for advanced configurations.