Deploy Qwen3-8B on a Vast.ai GPU with vLLM and connect OpenClaw to it for private, self-hosted AI conversations with tool use.Documentation Index
Fetch the complete documentation index at: https://docs.vast.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
OpenClaw is an open-source AI assistant that runs locally on your machine. It supports multiple model providers through an OpenAI-compatible API, including self-hosted models via vLLM. In this guide, you will:- Launch a vLLM inference server on a Vast.ai GPU serving Qwen3-8B
- Install and configure OpenClaw locally to connect to the remote vLLM server
- Send messages through OpenClaw and receive responses from Qwen3-8B
Requirements
- Vast.ai account with credits loaded (quickstart guide)
- SSH key added to your Vast.ai account (SSH setup guide)
- Node.js 22.12.0 or later (nodejs.org)
- A terminal with
curlavailable
Step 1: Install the Vast.ai CLI
Bash
Bash
Step 2: Install OpenClaw
Bash
Bash
Text
OpenClaw requires Node.js 22.12.0 or later. If you see a version error, update Node.js or use nvm to install a compatible version.
Step 3: Find a GPU Instance
Search for an RTX 3090 with direct port access:Bash
Step 4: Deploy vLLM with Qwen3-8B
Create an instance using the offer ID from Step 3. The offer ID is in the first column (ID) of the search results.
Bash
YOUR_OFFER_ID with the ID from Step 3 (e.g., 12345678).
Text
new_contract value, this is your instance ID, which is different from the offer ID. You will use the instance ID in the remaining steps.
This command uses Vast’s vLLM image, which includes a reverse proxy that automatically generates an authentication token (OPEN_BUTTON_TOKEN) for your instance.
Key environment variables:
| Variable | Purpose |
|---|---|
VLLM_MODEL | Hugging Face model to serve |
VLLM_ARGS | Arguments passed to vllm serve |
--max-model-len 32000 | Maximum context length for RTX 3090 |
--enable-auto-tool-choice | Required for OpenClaw tool calling |
--tool-call-parser hermes | Tool call format compatible with Qwen3 |
OPEN_BUTTON_TOKEN=1 | Tells the image to generate an authentication token |
The
--max-model-len value of 32000 is tuned for the RTX 3090. The model uses ~15 GiB of VRAM, leaving ~4.5 GiB for KV cache. Using 32768 (Qwen3-8B’s native context) will fail with an out-of-memory error.success: False, the machine may be unavailable. Try a different offer ID from Step 3.
Step 5: Wait for Model Loading
ReplaceYOUR_INSTANCE_ID with the new_contract value from Step 4 (e.g., 98765432).
Wait for the status to show running:
Bash
Application startup complete.:
Bash
Bash
PORT and HOST with the values from the ssh-url output (e.g., ssh://root@ssh5.vast.ai:33426 means HOST=ssh5.vast.ai and PORT=33426).
vLLM will download the model weights (~16 GB), then initialize the GPU and start the API server. This typically takes 3-8 minutes depending on download speed. Press Ctrl+C to stop watching once you see the startup message.
Step 6: Get Connection Details
Find your instance’s IP address and port:Bash
Text
OPEN_BUTTON_TOKEN that protects the API. SSH into the instance to get it:
Bash
Text
Bash
Text
Bash
JSON
Bash
Qwen3-8B includes a thinking mode by default. The response may contain
<think>...</think> reasoning tokens before the final answer. This is expected behavior.Step 7: Configure OpenClaw
Set the vLLM API key environment variable to theOPEN_BUTTON_TOKEN from Step 6:
Bash
Bash
~/.openclaw/openclaw.json:
JSON
INSTANCE_IP:EXTERNAL_PORT with the values from Step 6.
Key configuration fields:
| Field | Purpose |
|---|---|
baseUrl | Your vLLM API endpoint from Step 6 |
apiKey | Reads the VLLM_API_KEY environment variable at runtime |
api | Protocol to use, openai-completions for vLLM’s OpenAI-compatible API |
reasoning | Set to false to disable structured reasoning (Qwen3’s thinking mode is separate) |
contextWindow | Must match the --max-model-len value from Step 4 |
maxTokens | Maximum tokens per response |
Bash
Text
Step 8: Test OpenClaw
Send a message through OpenClaw to the vLLM backend:Bash
Text
--thinking off flag disables Qwen3’s reasoning mode. Without it, responses may include <think>...</think> tokens before the answer.
You now have a private AI assistant powered by your own GPU, no third-party API keys required. From here, you can start an interactive session, connect additional tools, or swap in a different model.
Troubleshooting
Instance stuck in “loading”
If the instance stays inloading for more than 15 minutes, it may have failed silently. Destroy it and try a different offer from Step 3:
Bash
“Model context window too small” error
OpenClaw requires a minimum context window of 16,000 tokens. If you see this error, check that--max-model-len in the vLLM creation command is set to at least 32000. OpenClaw’s system prompt and tool schemas consume approximately 12,000-13,000 tokens, so the model needs enough remaining context for your messages and responses.
”auto tool choice requires —enable-auto-tool-choice” error
OpenClaw uses tool calling by default. Add--enable-auto-tool-choice --tool-call-parser hermes to the vLLM creation command.
”LLM request timed out” with newer OpenClaw versions
OpenClaw versions after 2026.2.13 have a known bug in the embedded agent’s streaming response path. The vLLM server generates tokens correctly, but OpenClaw’s client never commits the assistant payload, causing a timeout after ~30 seconds. Directcurl requests to the same endpoint work fine.
To fix this, downgrade to the compatible version:
Bash
Context overflow errors
If you see “Context overflow: prompt too large for the model”, the conversation has exceeded the model’s context window. Start a fresh session:Bash
Cleanup
When you’re done, destroy the instance to stop billing:Bash