Running SGLang Router on Vast.ai
When serving LLMs in production, a single GPU instance quickly becomes a bottleneck. Requests queue up during traffic spikes, latency increases, and scaling requires expensive hardware upgrades. SGLang Router solves this by distributing requests across multiple workers running the same model on separate GPUs. Instead of vertical scaling (buying bigger GPUs), you scale horizontally by adding more workers. What makes SGLang Router particularly effective is its cache-aware routing policy. Traditional load balancers distribute requests randomly or round-robin, which fragments the KV cache across workers. SGLang Router maintains a prefix tree of cached prompts and routes similar requests to the same worker, maximizing cache reuse and reducing latency. This means you get better performance from the same hardware compared to naive load balancing. This guide walks through deploying Llama 3.1 8B on two Vast.ai GPU instances with SGLang, setting up the router to distribute requests between them, and testing inference through the OpenAI-compatible API. You’ll see how to configure different routing policies, scale to additional workers, and monitor request distribution across the system.What This Guide Covers
- Deploy SGLang workers on Vast.ai GPU instances
- Set up SGLang Router for load balancing
- Test the deployment with curl and Python
- Configure different models and routing policies
Why Vast.ai
This deployment requires multiple GPU instances with direct port access for the SGLang API endpoints. Vast.ai provides on-demand GPU rentals with per-minute billing and static IPs, allowing you to deploy workers as needed without long-term commitments. The marketplace model offers access to a variety of GPU types at competitive spot pricing.Hardware Requirements
- GPU VRAM: 24GB minimum per worker (Llama 3.1 8B requires ~14GB for model weights in BF16 precision, plus overhead for KV cache and batch processing)
- Disk Space: 100GB per instance (SGLang Docker image is ~15GB, model weights are ~15GB, plus workspace)
- Compute Capability: 7.0+ (Volta architecture or newer for optimal performance)
- Direct Port Access: Required for exposing the SGLang API endpoint
- 2x RTX 4090 (24GB each)
- SGLang Router: v0.3.2
- Model: meta-llama/Llama-3.1-8B-Instruct
Prerequisites
- Vast.ai account and API key (Sign up here)
- HuggingFace token for model access
- Python 3.10+
Step 1: Find GPU Instances
Search for available GPUs that meet the model’s requirements:gpu_ram >= 24: At least 24GB VRAM (required for Llama 8B)compute_cap >= 70: Volta architecture or newer (ensures compatibility)direct_port_count >= 1: At least one direct port for API accessrentable=true: Instance is available to rent--order dph_total: Sort by price (dollars per hour, cheapest first)
- GPU model (RTX 4090, A5000, etc.)
dph_total: Price per hour in dollarsgpu_ram: VRAM available- Reliability scores
dph_total (price) and high reliability scores. Write down the two offer IDs - you’ll need them in the next step.
Step 2: Deploy SGLang Workers
Create two worker instances running the same model. This allows the router to distribute requests across multiple GPUs. Replace<OFFER_ID_1> and <OFFER_ID_2> with the IDs from Step 1:
--image lmsysorg/sglang:latest: Official SGLang Docker image (~15GB)--env "-p 8000:8000 -e HF_TOKEN=$HF_TOKEN": Expose port 8000 and pass HuggingFace token--disk 100: Allocate 100GB disk space (needed for model weights ~15GB + image ~15GB)--onstart-cmd: Command to run when instance starts - launches SGLang server
"success": true and an instance ID number. Example:
Step 3: Wait for Instances to Start
The instances need 5-10 minutes to initialize. During this time:- SGLang Docker image is downloaded (~15GB)
- Model weights are downloaded from HuggingFace (~15GB for Llama 8B)
- Model is loaded into GPU memory
Status column will progress through:
loading- Instance is initializingrunning- Instance is ready
running status.
Verify SGLang is ready by checking logs:
Step 4: Get Worker Endpoints
Now you need to find the public URLs for your workers. Option 1: Web Console (Easiest)- Navigate to https://cloud.vast.ai/instances/
- Find your instances in the list
- Click the IP address button for each instance
- Note the public IP and port mapping for port 8000
8000:45678
This means:
8000is the container port (internal)45678is the host port (external - use this one!)
http://<PUBLIC_IP>:<HOST_PORT>
Option 2: CLI
Step 5: Start SGLang Router
Install and start the router locally. The router will run on your machine and distribute requests to the remote Vast.ai workers.<WORKER1_IP>:<WORKER1_PORT> and <WORKER2_IP>:<WORKER2_PORT> with the actual endpoints from Step 4.
What you’ll see:
The router will start and display logs indicating it has detected the workers. Look for messages about:
- Router starting on port 30000
- Workers being registered
- Health checks passing
http://localhost:30000 and will distribute requests across your two workers using round-robin policy.
Step 6: Test the Deployment
Test with curl
Send a test request to verify everything is working:Test with Python
Configuration Options
Load Balancing Policies
Round Robin (--policy round_robin):
Distributes requests evenly across workers in circular order. Simple and predictable. Good for testing and uniform workloads.
Cache-Aware (--policy cache_aware):
Routes requests to workers likely to have relevant KV cache entries. Improves throughput by maximizing cache reuse. Recommended for production deployments with repeated or similar queries.
Power of Two (--policy power_of_two):
Selects two random workers and routes to the one with lower load. Better load distribution than round-robin under variable workload conditions.
Scaling to More Workers
Add more workers by deploying additional Vast.ai instances (repeat Step 2) and restarting the router with all worker URLs:Cleanup
When finished, destroy your Vast.ai instances to stop charges:Next Steps
- Try different models: Deploy Qwen 2.5 or Mistral models by changing the
--modelparameter - Scale horizontally: Add a third or fourth worker to increase throughput
- Use cache-aware policy: Switch to
--policy cache_awarefor production deployments with repeated queries - Add monitoring: Track worker health and request distribution through the router’s logs
Additional Resources
- SGLang GitHub Repository - Official SGLang project with documentation and examples
- Llama 3.1 Model Card - Meta’s Llama 3.1 8B Instruct on HuggingFace
- Vast.ai CLI Documentation - Complete reference for Vast.ai CLI commands