OpenClaw AI Assistant with vLLM on Vast.ai - Vast.ai Documentation: Affordable GPU Cloud Marketplace

Deploy Qwen3-8B on a Vast.ai GPU with vLLM and connect OpenClaw to it for private, self-hosted AI conversations with tool use.

Overview

OpenClaw is an open-source AI assistant that runs locally on your machine. It supports multiple model providers through an OpenAI-compatible API, including self-hosted models via vLLM. In this guide, you will:

Launch a vLLM inference server on a Vast.ai GPU serving Qwen3-8B
Install and configure OpenClaw locally to connect to the remote vLLM server
Send messages through OpenClaw and receive responses from Qwen3-8B

This gives you a private AI assistant powered by your own GPU instance, no API keys from third-party providers needed.

Requirements

Vast.ai account with credits loaded (quickstart guide)
SSH key added to your Vast.ai account (SSH setup guide)
Node.js 22.12.0 or later (nodejs.org)
A terminal with curl available

This guide creates a paid GPU instance that bills by the hour. An RTX 3090 typically costs $0.15-0.20/hr, following this guide end-to-end takes about 10 minutes and costs less than $0.05. Remember to destroy the instance when you’re done, see Cleanup.

Step 1: Install the Vast.ai CLI

Bash

pip install --upgrade vastai
vastai set api-key YOUR_API_KEY

Verify the CLI is working:

Bash

vastai show user

You should see your account details and credit balance.

Step 2: Install OpenClaw

Bash

npm install -g openclaw@2026.2.13

Verify the installation:

Bash

openclaw --version

Text

2026.2.13

OpenClaw requires Node.js 22.12.0 or later. If you see a version error, update Node.js or use nvm to install a compatible version.

This guide requires OpenClaw 2026.2.13. Later versions have a known bug where the embedded agent times out when connecting to self-hosted OpenAI-compatible backends like vLLM, even though the server is responding correctly. If you have a newer version installed, downgrade with npm install -g openclaw@2026.2.13.

Step 3: Find a GPU Instance

Search for an RTX 3090 with direct port access:

Bash

vastai search offers \
    "gpu_name = RTX_3090 num_gpus = 1 direct_port_count >= 1 cuda_vers >= 13.0" \
    --order "dph_base" --limit 5

The results show available machines sorted by price. Note the ID in the first column, you will use it in the next step. The RTX 3090 (24GB VRAM) is the minimum GPU for Qwen3-8B. The model requires ~15 GiB of VRAM, leaving ~4.5 GiB for KV cache.

Step 4: Deploy vLLM with Qwen3-8B

Create an instance using the offer ID from Step 3. The offer ID is in the first column (ID) of the search results.

Bash

vastai create instance YOUR_OFFER_ID \
    --image vastai/vllm:v0.16.0-cuda-12.9 \
    --env '-p 1111:1111 -p 8080:8080 -p 8000:8000 -p 8265:8265 -e OPEN_BUTTON_PORT=1111 -e OPEN_BUTTON_TOKEN=1 -e JUPYTER_DIR=/ -e DATA_DIRECTORY=/workspace/ -e PORTAL_CONFIG="localhost:1111:11111:/:Instance Portal|localhost:8000:18000:/docs:vLLM API|localhost:8265:28265:/:Ray Dashboard|localhost:8080:18080:/:Jupyter|localhost:8080:8080:/terminals/1:Jupyter Terminal" -e VLLM_MODEL=Qwen/Qwen3-8B -e VLLM_ARGS="--max-model-len 32000 --dtype auto --enable-auto-tool-choice --tool-call-parser hermes --host 127.0.0.1 --port 18000" -e AUTO_PARALLEL=true -e RAY_ADDRESS=127.0.0.1 -e RAY_ARGS="--head --port 6379 --dashboard-host 127.0.0.1 --dashboard-port 28265"' \
    --onstart-cmd 'entrypoint.sh' \
    --disk 50

Replace YOUR_OFFER_ID with the ID from Step 3 (e.g., 12345678).

Text

Started. {'success': True, 'new_contract': 98765432, 'instance_api_key': 'a1b2c3...'}

Note the new_contract value, this is your instance ID, which is different from the offer ID. You will use the instance ID in the remaining steps. This command uses Vast’s vLLM image, which includes a reverse proxy that automatically generates an authentication token (OPEN_BUTTON_TOKEN) for your instance. Key environment variables:

Variable	Purpose
`VLLM_MODEL`	Hugging Face model to serve
`VLLM_ARGS`	Arguments passed to `vllm serve`
`--max-model-len 32000`	Maximum context length for RTX 3090
`--enable-auto-tool-choice`	Required for OpenClaw tool calling
`--tool-call-parser hermes`	Tool call format compatible with Qwen3
`OPEN_BUTTON_TOKEN=1`	Tells the image to generate an authentication token

The --max-model-len value of 32000 is tuned for the RTX 3090. The model uses ~15 GiB of VRAM, leaving ~4.5 GiB for KV cache. Using 32768 (Qwen3-8B’s native context) will fail with an out-of-memory error.

If the command returns success: False, the machine may be unavailable. Try a different offer ID from Step 3.

Step 5: Wait for Model Loading

Replace YOUR_INSTANCE_ID with the new_contract value from Step 4 (e.g., 98765432). Wait for the status to show running:

Bash

vastai show instance YOUR_INSTANCE_ID

Once the instance is running, SSH in and watch the vLLM log until you see Application startup complete.:

Bash

vastai ssh-url YOUR_INSTANCE_ID

Bash

ssh -p PORT root@HOST 'tail -f /var/log/portal/vllm.log'

Replace PORT and HOST with the values from the ssh-url output (e.g., ssh://root@ssh5.vast.ai:33426 means HOST=ssh5.vast.ai and PORT=33426). vLLM will download the model weights (~16 GB), then initialize the GPU and start the API server. This typically takes 3-8 minutes depending on download speed. Press Ctrl+C to stop watching once you see the startup message.

Step 6: Get Connection Details

Find your instance’s IP address and port:

Bash

vastai show instance YOUR_INSTANCE_ID --raw | python3 -c "
import sys, json
d = json.load(sys.stdin)
ip = d['public_ipaddr']
port = d['ports']['8000/tcp'][0]['HostPort']
print(f'API endpoint: http://{ip}:{port}')
"

Text

API endpoint: http://INSTANCE_IP:EXTERNAL_PORT

Next, retrieve the authentication token. The instance automatically generates an OPEN_BUTTON_TOKEN that protects the API. SSH into the instance to get it:

Bash

vastai ssh-url YOUR_INSTANCE_ID

Text

ssh://root@ssh5.vast.ai:33426

Bash

ssh -p 33426 root@ssh5.vast.ai 'echo $OPEN_BUTTON_TOKEN'

Text

ebc1e4b9922bd49aacfb54bba36259c801f5c4d9edaace7576f9b1ecd067559d

Save this token, you will need it for all API requests and for the OpenClaw configuration. Verify the API is responding:

Bash

curl -s http://INSTANCE_IP:EXTERNAL_PORT/v1/models \
    -H "Authorization: Bearer YOUR_OPEN_BUTTON_TOKEN"

JSON

{
  "object": "list",
  "data": [{
    "id": "Qwen/Qwen3-8B",
    "object": "model",
    "owned_by": "vllm",
    "max_model_len": 32000
  }]
}

Test a chat completion:

Bash

curl -s http://INSTANCE_IP:EXTERNAL_PORT/v1/chat/completions \
    -H "Authorization: Bearer YOUR_OPEN_BUTTON_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-8B",
        "messages": [{"role": "user", "content": "Who are you? Introduce yourself briefly."}],
        "max_tokens": 256,
        "temperature": 0.6
    }'

You should see Qwen3-8B introduce itself: “I am Qwen, a large language model developed by Alibaba Cloud.”

Qwen3-8B includes a thinking mode by default. The response may contain <think>...</think> reasoning tokens before the final answer. This is expected behavior.

Step 7: Configure OpenClaw

Set the vLLM API key environment variable to the OPEN_BUTTON_TOKEN from Step 6:

Bash

export VLLM_API_KEY="YOUR_OPEN_BUTTON_TOKEN"

Create the OpenClaw configuration directory and file:

Bash

mkdir -p ~/.openclaw

Create ~/.openclaw/openclaw.json:

JSON

{
  "models": {
    "providers": {
      "vllm": {
        "baseUrl": "http://INSTANCE_IP:EXTERNAL_PORT/v1",
        "apiKey": "${VLLM_API_KEY}",
        "api": "openai-completions",
        "models": [
          {
            "id": "Qwen/Qwen3-8B",
            "name": "Qwen3 8B on Vast",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 32000,
            "maxTokens": 4096
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": { "primary": "vllm/Qwen/Qwen3-8B" }
    }
  }
}

Replace INSTANCE_IP:EXTERNAL_PORT with the values from Step 6. Key configuration fields:

Field	Purpose
`baseUrl`	Your vLLM API endpoint from Step 6
`apiKey`	Reads the `VLLM_API_KEY` environment variable at runtime
`api`	Protocol to use, `openai-completions` for vLLM’s OpenAI-compatible API
`reasoning`	Set to `false` to disable structured reasoning (Qwen3’s thinking mode is separate)
`contextWindow`	Must match the `--max-model-len` value from Step 4
`maxTokens`	Maximum tokens per response

Verify OpenClaw can see the model:

Bash

openclaw models list

Text

Model                                      Input      Ctx      Local Auth  Tags
vllm/Qwen/Qwen3-8B                         text       31k      no    yes   default

Step 8: Test OpenClaw

Send a message through OpenClaw to the vLLM backend:

Bash

openclaw agent --local --session-id test \
    --message "Who are you? Introduce yourself briefly." \
    --thinking off

Text

I am an AI assistant created by OpenClaw.

The --thinking off flag disables Qwen3’s reasoning mode. Without it, responses may include <think>...</think> tokens before the answer. You now have a private AI assistant powered by your own GPU, no third-party API keys required. From here, you can start an interactive session, connect additional tools, or swap in a different model.

Troubleshooting

Instance stuck in “loading”

If the instance stays in loading for more than 15 minutes, it may have failed silently. Destroy it and try a different offer from Step 3:

Bash

vastai destroy instance YOUR_INSTANCE_ID

“Model context window too small” error

OpenClaw requires a minimum context window of 16,000 tokens. If you see this error, check that --max-model-len in the vLLM creation command is set to at least 32000. OpenClaw’s system prompt and tool schemas consume approximately 12,000-13,000 tokens, so the model needs enough remaining context for your messages and responses.

”auto tool choice requires —enable-auto-tool-choice” error

OpenClaw uses tool calling by default. Add --enable-auto-tool-choice --tool-call-parser hermes to the vLLM creation command.

”LLM request timed out” with newer OpenClaw versions

OpenClaw versions after 2026.2.13 have a known bug in the embedded agent’s streaming response path. The vLLM server generates tokens correctly, but OpenClaw’s client never commits the assistant payload, causing a timeout after ~30 seconds. Direct curl requests to the same endpoint work fine. To fix this, downgrade to the compatible version:

Bash

npm install -g openclaw@2026.2.13

Context overflow errors

If you see “Context overflow: prompt too large for the model”, the conversation has exceeded the model’s context window. Start a fresh session:

Bash

openclaw agent --local --session-id new-session \
    --message "Your message here" \
    --thinking off

Cleanup

When you’re done, destroy the instance to stop billing:

Bash

vastai destroy instance YOUR_INSTANCE_ID

Documentation Index

​Overview

​Requirements

​Step 1: Install the Vast.ai CLI

​Step 2: Install OpenClaw

​Step 3: Find a GPU Instance

​Step 4: Deploy vLLM with Qwen3-8B

​Step 5: Wait for Model Loading

​Step 6: Get Connection Details

​Step 7: Configure OpenClaw

​Step 8: Test OpenClaw

​Troubleshooting

​Instance stuck in “loading”

​“Model context window too small” error

​”auto tool choice requires —enable-auto-tool-choice” error

​”LLM request timed out” with newer OpenClaw versions

​Context overflow errors

​Cleanup

​Resources

Overview

Requirements

Step 1: Install the Vast.ai CLI

Step 2: Install OpenClaw

Step 3: Find a GPU Instance

Step 4: Deploy vLLM with Qwen3-8B

Step 5: Wait for Model Loading

Step 6: Get Connection Details

Step 7: Configure OpenClaw

Step 8: Test OpenClaw

Troubleshooting

Instance stuck in “loading”

“Model context window too small” error

”auto tool choice requires —enable-auto-tool-choice” error

”LLM request timed out” with newer OpenClaw versions

Context overflow errors

Cleanup

Resources