The vLLM Serverless template can be used to infer LLMs on Vast GPU instances. This page documents required environment variables and endpoints to get started. A full PyWorker and Client implementation can be found here, which implements the endpoints below.

Environment Variables

HF_TOKEN(string): HuggingFace API token with read permissions, used to download gated models. Read more about HuggingFace tokens here. This token should be added to your Vast user account’s environment variables. The Getting Started guide shows this in step 1.
MODEL_NAME(string): Name of the model to be used for inference. Supported HuggingFace models are shown here.
VLLM_ARGS(string): vLLM specific arguments that are already pre-set in the template.

Some models on HuggingFace require the user to accept the terms and conditions on their HuggingFace account before using. For such models, this must be done first before using it with a Vast template.

Endpoints

/v1/completions/

This endpoint generates a text completion that attempts to match any context or pattern provided in a given prompt. Provide a text prompt, and the model returns the predicted continuation. This endpoint is best suited for single-turn tasks, whereas the /v1/chat/completions endpoint is optimized for multi-turn conversational scenarios.

Inputs

auth_data:

signature(string): A cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server’s public key, to verify that these specific details have not been tampered with.
cost(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker.
endpoint(string): Name of the Endpoint.
reqnum(int): The request number corresponding to this worker instance. Note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out-of-order requests caused by concurrency or small delays on the proxy server.
url(string): The address of the worker instance to send the request to.

payload:

input:
- model(string): The specific identifier of the model to be used for generating the text completion.
- prompt(optional, string): The input text that the model will use as a starting point to generate a response. Default is “Hello”.
- max_tokens(optional, int): The maximum number of tokens the model will generate for the response to the input. Default is 256.
- temperature(optional, float): A value between 0 and 2 that controls the randomness of the output. Higher values result in more creative and less predictable responses, while lower values make the output more deterministic. Default is 0.7.
- top_k(optional, int): An integer that restricts the model to sampling from the k most likely tokens at each step of the generation process. Default is 20.
- top_p(optional, float): A float between 0 and 1 that controls nucleus sampling. The model considers only the most probable tokens whose cumulative probability exceeds p. Default is 0.4.
- stream(optional, bool): A boolean flag that determines the response format. If true, the server will send back a stream of token-by-token events as they are generated. If false, it will send the full completion in a single response after it’s finished. Default is false.

JSON

{
  "auth_data": {
    "signature": "a_base64_encoded_signature_string_from_route_endpoint",
    "cost": 256,
    "endpoint": "Your-Endpoint-Name",
    "reqnum": 1234567890,
    "url": "http://worker-ip-address:port"
  },
  "payload": {
    "input": {
      "prompt": "The capital of the United States is",
      "model": "Qwen/Qwen3-8B",
      "max_tokens": 256,
      "temperature": 0.7,
      "top_k": 20,
      "top_p": 0.4,
      "stream": false
    }
  }
}

Depending on the model being used, other parameters like ‘temperature’ or ‘top_p’ may be supported. Passing in these values in parameters will forward the values to the model, but they are not required. All parameters can be found in the CompletionConfig class in client.py.

Outputs

id(string): A unique identifier for the completion request.
object(string): The type of object returned. For completions, this is always text_completion.
created(int): The Unix timestamp (in seconds) of when the completion was created.
model(string): The name of the model that generated the response.
choices:
- index(int): The index of the choice in the list (e.g., 0 for the first choice).
- text(string): The generated text for this completion choice.
- logprobs(object): This field is null unless you requested log probabilities. If requested, it contains the log probabilities of the generated tokens.
- finish_reason(string): The reason the model stopped generating text. Common values include length (reached max_tokens), stop (encountered a stop sequence), or tool_calls.
- stop_reason(string): Provides a more specific reason for why the model stopped, often related to internal model logic. It can be null if not applicable.
- prompt_logprobs(object): Similar to logprobs, but for the tokens in the initial prompt. It is null unless specifically requested.
usage:
- prompt_tokens(int): The number of tokens in the input prompt.
- total_tokens(int): The total number of tokens used in the request (prompt + completion).
- completion_tokens(int): The number of tokens in the generated completion.
- prompt_tokens_details(object): Provides a more detailed breakdown of prompt tokens. It is null unless requested.
kv_transfer_params(object): An extension field (outside the official OpenAI spec) that carries all the metadata you need to reuse or move around the model’s key/value (KV) cache instead of recomputing it on every call.

JSON

{
  "id": "cmpl-7bd54bc0b3f4d48abf3fe4fa3c11f8b",
  "object": "text_completion",
  "created": 1754334436,
  "model": "Qwen/Qwen3-8B",
  "choices": [
    {
      "index": 0,
      "text": " Washington D.C...",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 262,
    "completion_tokens": 256,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

/v1/chat/completions/

This endpoint generates a model response for a given conversational history. Unlike the /v1/completions/ endpoint, which is designed to continue a single text prompt, the chat endpoint excels at multi-turn dialogues. By providing a sequence of messages, each with a designated role (system, user, or assistant), you can simulate a conversation, and the model will generate the next appropriate message from the assistant.

Not all LLMs will work with this endpoint. The model must be fine-tuned to understand messages and tools. The default model used in the Vast template will work.

Inputs

auth_data:

signature(string): A cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server’s public key, to verify that these specific details have not been tampered with.
cost(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker.
endpoint(string): Name of the Endpoint.
reqnum(int): The request number corresponding to this worker instance. Note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out-of-order requests caused by concurrency or small delays on the proxy server.
url(string): The address of the worker instance to send the request to.

payload:

input:
- model(string): The specific identifier of the model to be used for generating the text completion.
- messages(array): A list of message objects that form the conversation history.
  - role(string): The role of the message author. Can be system, user, or assistant.
  - content(string): The content of the message.
- max_tokens(optional, int): The maximum number of tokens the model will generate for the response to the input. Default is 256.
- temperature(optional, float): A value between 0 and 2 that controls the randomness of the output. Higher values result in more creative and less predictable responses, while lower values make the output more deterministic. Default is 0.7.
- top_k(optional, int): An integer that restricts the model to sampling from the k most likely tokens at each step of the generation process. Default is 20.
- top_p(optional, float): A float between 0 and 1 that controls nucleus sampling. The model considers only the most probable tokens whose cumulative probability exceeds p. Default is 0.4.
- stream(optional, bool): A boolean flag that determines the response format. If true, the server will send back a stream of token-by-token events as they are generated. If false, it will send the full completion in a single response after it’s finished. Default is false.
- tools(optional, List[Dict[str, Any]]): A list of function definitions that the model can call to perform external actions. When a relevant tool is detected in the user’s prompt, the model can generate a JSON object with the function name and arguments to call. Your code can then execute this function and return the output to the model to continue the conversation.
- tool_choice(optional, string): This parameter controls how the model uses the functions provided in the tools list. It can be set to “none” to prevent the model from using any tools, “auto” to let the model decide when to call a function, or you can force the model to call a specific function by providing an object like {“type”: “function”, “function”: {“name”: “my_function_name”}}.

The max_tokens parameter, rather than the messages size, impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with max_tokens = 200 will take approximately 2 seconds to complete.

JSON

{
  "auth_data": {
    "signature": "a_base64_encoded_signature_string_from_route_endpoint",
    "cost": 2096,
    "endpoint": "Your-OpenAI-Endpoint-Name",
    "reqnum": 1234567893,
    "url": "http://worker-ip-address:port"
  },
  "payload": {
    "input": {
      "model": "Qwen/Qwen3-8B",
      "messages": [
        {
          "role": "user",
          "content": "What's the weather like in LA today?"
        }
      ],
      "max_tokens": 256,
      "temperature": 0.7,
      "top_k": 40,
      "top_p": 0.9,
      "stream": false,
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
              "type": "object",
              "properties": {
                "location": {
                  "type": "string",
                  "description": "The city and state, e.g. Los Angeles, CA"
                },
                "unit": {
                  "type": "string",
                  "enum": ["celsius", "fahrenheit"],
                  "description": "The unit of temperature"
                }
              },
              "required": ["location"]
            }
          }
        }
      ],
      "tool_choice": {
        "type": "function",
        "function": {
          "name": "get_current_weather"
        }
      }
    }
  }
}

Outputs

id(string): A unique identifier for the completion request.
object(string): The type of object returned. For chat completions, this is always chat.completion.
created(int): The Unix timestamp (in seconds) of when the completion was created.
model(string): The name of the model that generated the response.
choices:
- index(int): The index of the choice in the list (e.g., 0 for the first choice).
- messages(string): A message object generated by the model.
  - role(string): The role of the message author. Can be system, user, or assistant.
  - content(string): The content of the message.
  - tool_calls(array): Contains the function call(s) the model wants to execute. The arguments field is a JSON string containing the parameters extracted from the user’s prompt.
- finish_reason(string): The reason the model stopped generating text. Common values include length (reached max_tokens), stop (encountered a stop sequence), or tool_calls.
usage:
- prompt_tokens(int): The number of tokens in the input prompt.
- total_tokens(int): The total number of tokens used in the request (prompt + completion).
- completion_tokens(int): The number of tokens in the generated completion.

JSON

{
  "id": "chatcmpl-a1b2c3d4-e5f6-7890-1234-5g6h7j8k9l0m",
  "object": "chat.completion",
  "created": 1754336000,
  "model": "Qwen/Qwen3-8B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_Abc123Xyz",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\": \"Los Angeles, CA\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 85,
    "completion_tokens": 18,
    "total_tokens": 103
  }
}

Serverless

Pre-built Templates

​Environment Variables

​Endpoints

​/v1/completions/

​Inputs

​Outputs

​/v1/chat/completions/

​Inputs

​Outputs

Environment Variables

Endpoints

/v1/completions/

Inputs

Outputs

/v1/chat/completions/

Inputs

Outputs