Overview
OpenClaw is an open-source AI assistant that runs locally on your machine. It supports multiple model providers through an OpenAI-compatible API, including self-hosted models via vLLM. In this guide, you will:- Launch a vLLM inference server on a Vast.ai GPU serving Qwen3-8B
- Install and configure OpenClaw locally to connect to the remote vLLM server
- Send messages through OpenClaw and receive responses from Qwen3-8B
Requirements
- Vast.ai account with credits loaded (quickstart guide)
- SSH key added to your Vast.ai account (SSH setup guide)
- Node.js 22.12.0 or later (nodejs.org)
- A terminal with
curlavailable
Step 1: Install the Vast.ai CLI
Bash
Bash
Step 2: Install OpenClaw
Bash
Bash
Text
OpenClaw requires Node.js 22.12.0 or later. If you see a version error, update Node.js or use nvm to install a compatible version.
Step 3: Find a GPU Instance
Search for an RTX 3090 with direct port access:Bash
Step 4: Deploy vLLM with Qwen3-8B
Create an instance using the offer ID from Step 3. The offer ID is in the first column (ID) of the search results.
Bash
YOUR_OFFER_ID with the ID from Step 3 (e.g., 12345678).
Text
new_contract value — this is your instance ID, which is different from the offer ID. You will use the instance ID in the remaining steps.
This command uses Vast’s vLLM image, which includes a reverse proxy that automatically generates an authentication token (OPEN_BUTTON_TOKEN) for your instance.
Key environment variables:
| Variable | Purpose |
|---|---|
VLLM_MODEL | Hugging Face model to serve |
VLLM_ARGS | Arguments passed to vllm serve |
--max-model-len 32000 | Maximum context length for RTX 3090 |
--enable-auto-tool-choice | Required for OpenClaw tool calling |
--tool-call-parser hermes | Tool call format compatible with Qwen3 |
OPEN_BUTTON_TOKEN=1 | Tells the image to generate an authentication token |
The
--max-model-len value of 32000 is tuned for the RTX 3090. The model uses ~15 GiB of VRAM, leaving ~4.5 GiB for KV cache. Using 32768 (Qwen3-8B’s native context) will fail with an out-of-memory error.success: False, the machine may be unavailable. Try a different offer ID from Step 3.
Step 5: Wait for Model Loading
ReplaceYOUR_INSTANCE_ID with the new_contract value from Step 4 (e.g., 98765432).
Wait for the status to show running:
Bash
Application startup complete.:
Bash
Bash
PORT and HOST with the values from the ssh-url output (e.g., ssh://root@ssh5.vast.ai:33426 means HOST=ssh5.vast.ai and PORT=33426).
vLLM will download the model weights (~16 GB), then initialize the GPU and start the API server. This typically takes 3–8 minutes depending on download speed. Press Ctrl+C to stop watching once you see the startup message.
Step 6: Get Connection Details
Find your instance’s IP address and port:Bash
Text
OPEN_BUTTON_TOKEN that protects the API. SSH into the instance to get it:
Bash
Text
Bash
Text
Bash
JSON
Bash
Qwen3-8B includes a thinking mode by default. The response may contain
<think>...</think> reasoning tokens before the final answer. This is expected behavior.Step 7: Configure OpenClaw
Set the vLLM API key environment variable to theOPEN_BUTTON_TOKEN from Step 6:
Bash
Bash
~/.openclaw/openclaw.json:
JSON
INSTANCE_IP:EXTERNAL_PORT with the values from Step 6.
Key configuration fields:
| Field | Purpose |
|---|---|
baseUrl | Your vLLM API endpoint from Step 6 |
apiKey | Reads the VLLM_API_KEY environment variable at runtime |
api | Protocol to use — openai-completions for vLLM’s OpenAI-compatible API |
reasoning | Set to false to disable structured reasoning (Qwen3’s thinking mode is separate) |
contextWindow | Must match the --max-model-len value from Step 4 |
maxTokens | Maximum tokens per response |
Bash
Text
Step 8: Test OpenClaw
Send a message through OpenClaw to the vLLM backend:Bash
Text
--thinking off flag disables Qwen3’s reasoning mode. Without it, responses may include <think>...</think> tokens before the answer.
You now have a private AI assistant powered by your own GPU — no third-party API keys required. From here, you can start an interactive session, connect additional tools, or swap in a different model.
Troubleshooting
Instance stuck in “loading”
If the instance stays inloading for more than 15 minutes, it may have failed silently. Destroy it and try a different offer from Step 3:
Bash
“Model context window too small” error
OpenClaw requires a minimum context window of 16,000 tokens. If you see this error, check that--max-model-len in the vLLM creation command is set to at least 32000. OpenClaw’s system prompt and tool schemas consume approximately 12,000-13,000 tokens, so the model needs enough remaining context for your messages and responses.
”auto tool choice requires —enable-auto-tool-choice” error
OpenClaw uses tool calling by default. Add--enable-auto-tool-choice --tool-call-parser hermes to the vLLM creation command.
”LLM request timed out” with newer OpenClaw versions
OpenClaw versions after 2026.2.13 have a known bug in the embedded agent’s streaming response path. The vLLM server generates tokens correctly, but OpenClaw’s client never commits the assistant payload, causing a timeout after ~30 seconds. Directcurl requests to the same endpoint work fine.
To fix this, downgrade to the compatible version:
Bash
Context overflow errors
If you see “Context overflow: prompt too large for the model”, the conversation has exceeded the model’s context window. Start a fresh session:Bash
Cleanup
When you’re done, destroy the instance to stop billing:Bash