Deploying GLM-4.7-Flash on Vast.ai
GLM-4.7-Flash is a 30B-parameter Mixture of Experts model from Zhipu AI that activates 3B parameters per token. Despite being roughly the same active parameter count as Qwen3-30B-A3B, GLM-4.7-Flash has a fundamentally different attention architecture: it uses 20 key-value heads with a 256-dimensional value head (compared to Qwen3’s 4 KV heads with 128-dimensional values). This means its KV cache consumes approximately 10x more memory per token of context, making hardware selection critical for long-context deployments. This guide covers deploying GLM-4.7-Flash on Vast.ai using SGLang with 4x RTX 3090 GPUs.Why the High Memory Requirement
GLM-4.7-Flash uses Multi-Head Attention (MHA) rather than the Grouped Query Attention (GQA) common in recent MoE models. This architectural choice affects KV cache size directly:| Parameter | GLM-4.7-Flash | Qwen3-30B-A3B |
|---|---|---|
| num_hidden_layers | 47 | 48 |
| num_attention_heads | 20 | 32 |
| num_key_value_heads | 20 | 4 |
| hidden_size | 2048 | 2048 |
| v_head_dim | 256 | 128 |
- GLM-4.7-Flash:
2 x 47 x 20 x 256 x 2 bytes= ~962 KB/token - Qwen3-30B-A3B:
2 x 48 x 4 x 128 x 2 bytes= ~96 KB/token
What We’re Deploying
This guide uses the following configuration:- GPUs: 4x RTX 3090 (96 GB total VRAM)
- Context Length: 8,192 tokens
- Disk: 200 GB
- CUDA: 12.2–12.6
- Docker image:
lmsysorg/sglang:dev-pr-17247— must use this image; thelatesttag lacks MLA support for GLM-4.7-Flash
Prerequisites
- A Vast.ai account with credits
- Vast.ai CLI installed (
pip install vastai) - Your Vast.ai API key configured (
vastai set api-key YOUR_API_KEY)
Step 1: Find an Instance
Search for 4x RTX 3090 instances:gpu_name=RTX_3090: Target GPU typenum_gpus=4: Four GPUs for tensor parallelismdirect_port_count>=1: At least one direct port for API accesscuda_vers>=12.2 cuda_vers<=12.6: CUDA version range that avoids driver issues
Step 2: Deploy the Model
Generate an API key to secure your endpoint:<OFFER_ID> with the ID from Step 1:
--tp-size 4: Distribute model across all 4 GPUs using tensor parallelism--context-length 8192: Maximum sequence length (increase if you have more VRAM)--dtype float16: Required for RTX 3090 which does not natively support bfloat16. Use--dtype bfloat16on A100/H100--mem-fraction-static 0.85: Allocate 85% of GPU memory for model and KV cache--trust-remote-code: Required for the GLM-4.7-Flash architecture--api-key: Secures the endpoint with bearer token authentication
For longer context, use GPUs with more VRAM (like A100 or H100) and increase
--context-length. A100/H100 also support --dtype bfloat16.Step 3: Monitor Startup
The model is ~60 GB and takes 8–10 minutes to download on first deployment. Monitor progress with:Loading model weights— Download and loading in progressThe server is fired up and ready to roll!— Server is ready to accept requests
Step 4: Send Requests
Using curl
Using OpenAI SDK
Cleanup
Destroy your instance when finished to stop charges:Conclusion
GLM-4.7-Flash offers strong reasoning and coding capabilities in a 3B active parameter footprint. The trade-off is its attention architecture—using 20 KV heads instead of 4 means you need more VRAM per token of context than similarly-sized MoE models. For applications that need 8k context windows, 4x RTX 3090 provides a low-cost deployment option. For longer context requirements, scaling to A100 or H100 instances allows you to increase--context-length proportionally with available VRAM.
Next Steps
- Increase context: Use GPUs with more VRAM (like A100 or H100) to serve longer context windows
- Add load balancing: Use SGLang Router to distribute requests across multiple instances
Additional Resources
- GLM-4.7-Flash Model Card — Model weights and architecture details
- SGLang Documentation — SGLang server configuration and features
- SGLang Docker Images — Available Docker tags including dev builds
- Vast.ai CLI Guide — Complete CLI reference for managing instances