Skip to main content
Deploy Qwen3-8B on a Vast.ai GPU with vLLM and connect OpenClaw to it for private, self-hosted AI conversations with tool use.

Overview

OpenClaw is an open-source AI assistant that runs locally on your machine. It supports multiple model providers through an OpenAI-compatible API, including self-hosted models via vLLM. In this guide, you will:
  1. Launch a vLLM inference server on a Vast.ai GPU serving Qwen3-8B
  2. Install and configure OpenClaw locally to connect to the remote vLLM server
  3. Send messages through OpenClaw and receive responses from Qwen3-8B
This gives you a private AI assistant powered by your own GPU instance — no API keys from third-party providers needed.

Requirements

This guide creates a paid GPU instance that bills by the hour. An RTX 3090 typically costs 0.150.20/hrfollowingthisguideendtoendtakesabout10minutesandcostslessthan0.15–0.20/hr — following this guide end-to-end takes about 10 minutes and costs less than 0.05. Remember to destroy the instance when you’re done — see Cleanup.

Step 1: Install the Vast.ai CLI

Bash
pip install --upgrade vastai
vastai set api-key YOUR_API_KEY
Verify the CLI is working:
Bash
vastai show user
You should see your account details and credit balance.

Step 2: Install OpenClaw

Bash
npm install -g openclaw@2026.2.13
Verify the installation:
Bash
openclaw --version
Text
2026.2.13
OpenClaw requires Node.js 22.12.0 or later. If you see a version error, update Node.js or use nvm to install a compatible version.
This guide requires OpenClaw 2026.2.13. Later versions have a known bug where the embedded agent times out when connecting to self-hosted OpenAI-compatible backends like vLLM, even though the server is responding correctly. If you have a newer version installed, downgrade with npm install -g openclaw@2026.2.13.

Step 3: Find a GPU Instance

Search for an RTX 3090 with direct port access:
Bash
vastai search offers \
    "gpu_name = RTX_3090 num_gpus = 1 direct_port_count >= 1 cuda_vers >= 13.0" \
    --order "dph_base" --limit 5
The results show available machines sorted by price. Note the ID in the first column — you will use it in the next step. The RTX 3090 (24GB VRAM) is the minimum GPU for Qwen3-8B. The model requires ~15 GiB of VRAM, leaving ~4.5 GiB for KV cache.

Step 4: Deploy vLLM with Qwen3-8B

Create an instance using the offer ID from Step 3. The offer ID is in the first column (ID) of the search results.
Bash
vastai create instance YOUR_OFFER_ID \
    --image vastai/vllm:v0.16.0-cuda-12.9 \
    --env '-p 1111:1111 -p 8080:8080 -p 8000:8000 -p 8265:8265 -e OPEN_BUTTON_PORT=1111 -e OPEN_BUTTON_TOKEN=1 -e JUPYTER_DIR=/ -e DATA_DIRECTORY=/workspace/ -e PORTAL_CONFIG="localhost:1111:11111:/:Instance Portal|localhost:8000:18000:/docs:vLLM API|localhost:8265:28265:/:Ray Dashboard|localhost:8080:18080:/:Jupyter|localhost:8080:8080:/terminals/1:Jupyter Terminal" -e VLLM_MODEL=Qwen/Qwen3-8B -e VLLM_ARGS="--max-model-len 32000 --dtype auto --enable-auto-tool-choice --tool-call-parser hermes --host 127.0.0.1 --port 18000" -e AUTO_PARALLEL=true -e RAY_ADDRESS=127.0.0.1 -e RAY_ARGS="--head --port 6379 --dashboard-host 127.0.0.1 --dashboard-port 28265"' \
    --onstart-cmd 'entrypoint.sh' \
    --disk 50
Replace YOUR_OFFER_ID with the ID from Step 3 (e.g., 12345678).
Text
Started. {'success': True, 'new_contract': 98765432, 'instance_api_key': 'a1b2c3...'}
Note the new_contract value — this is your instance ID, which is different from the offer ID. You will use the instance ID in the remaining steps. This command uses Vast’s vLLM image, which includes a reverse proxy that automatically generates an authentication token (OPEN_BUTTON_TOKEN) for your instance. Key environment variables:
VariablePurpose
VLLM_MODELHugging Face model to serve
VLLM_ARGSArguments passed to vllm serve
--max-model-len 32000Maximum context length for RTX 3090
--enable-auto-tool-choiceRequired for OpenClaw tool calling
--tool-call-parser hermesTool call format compatible with Qwen3
OPEN_BUTTON_TOKEN=1Tells the image to generate an authentication token
The --max-model-len value of 32000 is tuned for the RTX 3090. The model uses ~15 GiB of VRAM, leaving ~4.5 GiB for KV cache. Using 32768 (Qwen3-8B’s native context) will fail with an out-of-memory error.
If the command returns success: False, the machine may be unavailable. Try a different offer ID from Step 3.

Step 5: Wait for Model Loading

Replace YOUR_INSTANCE_ID with the new_contract value from Step 4 (e.g., 98765432). Wait for the status to show running:
Bash
vastai show instance YOUR_INSTANCE_ID
Once the instance is running, SSH in and watch the vLLM log until you see Application startup complete.:
Bash
vastai ssh-url YOUR_INSTANCE_ID
Bash
ssh -p PORT root@HOST 'tail -f /var/log/portal/vllm.log'
Replace PORT and HOST with the values from the ssh-url output (e.g., ssh://root@ssh5.vast.ai:33426 means HOST=ssh5.vast.ai and PORT=33426). vLLM will download the model weights (~16 GB), then initialize the GPU and start the API server. This typically takes 3–8 minutes depending on download speed. Press Ctrl+C to stop watching once you see the startup message.

Step 6: Get Connection Details

Find your instance’s IP address and port:
Bash
vastai show instance YOUR_INSTANCE_ID --raw | python3 -c "
import sys, json
d = json.load(sys.stdin)
ip = d['public_ipaddr']
port = d['ports']['8000/tcp'][0]['HostPort']
print(f'API endpoint: http://{ip}:{port}')
"
Text
API endpoint: http://INSTANCE_IP:EXTERNAL_PORT
Next, retrieve the authentication token. The instance automatically generates an OPEN_BUTTON_TOKEN that protects the API. SSH into the instance to get it:
Bash
vastai ssh-url YOUR_INSTANCE_ID
Text
ssh://root@ssh5.vast.ai:33426
Bash
ssh -p 33426 root@ssh5.vast.ai 'echo $OPEN_BUTTON_TOKEN'
Text
ebc1e4b9922bd49aacfb54bba36259c801f5c4d9edaace7576f9b1ecd067559d
Save this token — you will need it for all API requests and for the OpenClaw configuration. Verify the API is responding:
Bash
curl -s http://INSTANCE_IP:EXTERNAL_PORT/v1/models \
    -H "Authorization: Bearer YOUR_OPEN_BUTTON_TOKEN"
JSON
{
  "object": "list",
  "data": [{
    "id": "Qwen/Qwen3-8B",
    "object": "model",
    "owned_by": "vllm",
    "max_model_len": 32000
  }]
}
Test a chat completion:
Bash
curl -s http://INSTANCE_IP:EXTERNAL_PORT/v1/chat/completions \
    -H "Authorization: Bearer YOUR_OPEN_BUTTON_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-8B",
        "messages": [{"role": "user", "content": "Who are you? Introduce yourself briefly."}],
        "max_tokens": 256,
        "temperature": 0.6
    }'
You should see Qwen3-8B introduce itself: “I am Qwen, a large language model developed by Alibaba Cloud.”
Qwen3-8B includes a thinking mode by default. The response may contain <think>...</think> reasoning tokens before the final answer. This is expected behavior.

Step 7: Configure OpenClaw

Set the vLLM API key environment variable to the OPEN_BUTTON_TOKEN from Step 6:
Bash
export VLLM_API_KEY="YOUR_OPEN_BUTTON_TOKEN"
Create the OpenClaw configuration directory and file:
Bash
mkdir -p ~/.openclaw
Create ~/.openclaw/openclaw.json:
JSON
{
  "models": {
    "providers": {
      "vllm": {
        "baseUrl": "http://INSTANCE_IP:EXTERNAL_PORT/v1",
        "apiKey": "${VLLM_API_KEY}",
        "api": "openai-completions",
        "models": [
          {
            "id": "Qwen/Qwen3-8B",
            "name": "Qwen3 8B on Vast",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 32000,
            "maxTokens": 4096
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": { "primary": "vllm/Qwen/Qwen3-8B" }
    }
  }
}
Replace INSTANCE_IP:EXTERNAL_PORT with the values from Step 6. Key configuration fields:
FieldPurpose
baseUrlYour vLLM API endpoint from Step 6
apiKeyReads the VLLM_API_KEY environment variable at runtime
apiProtocol to use — openai-completions for vLLM’s OpenAI-compatible API
reasoningSet to false to disable structured reasoning (Qwen3’s thinking mode is separate)
contextWindowMust match the --max-model-len value from Step 4
maxTokensMaximum tokens per response
Verify OpenClaw can see the model:
Bash
openclaw models list
Text
Model                                      Input      Ctx      Local Auth  Tags
vllm/Qwen/Qwen3-8B                         text       31k      no    yes   default

Step 8: Test OpenClaw

Send a message through OpenClaw to the vLLM backend:
Bash
openclaw agent --local --session-id test \
    --message "Who are you? Introduce yourself briefly." \
    --thinking off
Text
I am an AI assistant created by OpenClaw.
The --thinking off flag disables Qwen3’s reasoning mode. Without it, responses may include <think>...</think> tokens before the answer. You now have a private AI assistant powered by your own GPU — no third-party API keys required. From here, you can start an interactive session, connect additional tools, or swap in a different model.

Troubleshooting

Instance stuck in “loading”

If the instance stays in loading for more than 15 minutes, it may have failed silently. Destroy it and try a different offer from Step 3:
Bash
vastai destroy instance YOUR_INSTANCE_ID

“Model context window too small” error

OpenClaw requires a minimum context window of 16,000 tokens. If you see this error, check that --max-model-len in the vLLM creation command is set to at least 32000. OpenClaw’s system prompt and tool schemas consume approximately 12,000-13,000 tokens, so the model needs enough remaining context for your messages and responses.

”auto tool choice requires —enable-auto-tool-choice” error

OpenClaw uses tool calling by default. Add --enable-auto-tool-choice --tool-call-parser hermes to the vLLM creation command.

”LLM request timed out” with newer OpenClaw versions

OpenClaw versions after 2026.2.13 have a known bug in the embedded agent’s streaming response path. The vLLM server generates tokens correctly, but OpenClaw’s client never commits the assistant payload, causing a timeout after ~30 seconds. Direct curl requests to the same endpoint work fine. To fix this, downgrade to the compatible version:
Bash
npm install -g openclaw@2026.2.13

Context overflow errors

If you see “Context overflow: prompt too large for the model”, the conversation has exceeded the model’s context window. Start a fresh session:
Bash
openclaw agent --local --session-id new-session \
    --message "Your message here" \
    --thinking off

Cleanup

When you’re done, destroy the instance to stop billing:
Bash
vastai destroy instance YOUR_INSTANCE_ID

Resources