Pre-built Templates
vLLM
12 min
the vllm serverless template can be used to infer llms on vast gpu instances this page documents required environment variables and endpoints to get started a full {{pyworker}} and client implementation can be found here , which implements the endpoints below environment variables hf token (string) huggingface api token with read permissions, used to download gated models read more about huggingface tokens here this token should be added to your vast user account's environment variables the getting started guide shows this in step 1 model name (string) name of the model to be used for inference supported huggingface models are shown here https //huggingface co/docs/text generation inference/en/supported models vllm args (string) vllm specific arguments that are already pre set in the template some models on huggingface require the user to accept the terms and conditions on their huggingface account before using for such models, this must be done first before using it with a vast template endpoints /v1/completions/ this endpoint generates a text completion that attempts to match any context or pattern provided in a given prompt provide a text prompt, and the model returns the predicted continuation this endpoint is best suited for single turn tasks, whereas the /v1/chat/completions endpoint is optimized for multi turn conversational scenarios inputs a uth data signature (string) a cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server clients can use this signature, along with the server's public key, to verify that these specific details have not been tampered with cost (float) the estimated compute resources for the request the units of this cost are defined by the {{pyworker}} endpoint (string) name of the endpoint reqnum (int) the request number corresponding to this worker instance note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out of order requests caused by concurrency or small delays on the proxy server url (string) the address of the worker instance to send the request to p ayload input model (string) the specific identifier of the model to be used for generating the text completion prompt (optional, string) the input text that the model will use as a starting point to generate a response default is "hello" max tokens (optional, int) the maximum number of tokens the model will generate for the response to the input default is 256 temperature (optional, float) a value between 0 and 2 that controls the randomness of the output higher values result in more creative and less predictable responses, while lower values make the output more deterministic default is 0 7 top k (optional, int) an integer that restricts the model to sampling from the k most likely tokens at each step of the generation process default is 20 top p (optional, float) a float between 0 and 1 that controls nucleus sampling the model considers only the most probable tokens whose cumulative probability exceeds p default is 0 4 stream (optional, bool) a boolean flag that determines the response format if true, the server will send back a stream of token by token events as they are generated if false, it will send the full completion in a single response after it's finished default is false { "auth data" { "signature" "a base64 encoded signature string from route endpoint", "cost" 256, "endpoint" "your endpoint name", "reqnum" 1234567890, "url" "http //worker ip address\ port" }, "payload" { "input" { "prompt" "the capital of the united states is", "model" "qwen/qwen3 8b", "max tokens" 256, "temperature" 0 7, "top k" 20, "top p" 0 4, "stream" false } } } depending on the model being used, other parameters like 'temperature' or 'top p' may be supported passing in these values in parameters will forward the values to the model, but they are not required all parameters can be found in the completionconfig class in client py outputs id (string) a unique identifier for the completion request object (string) the type of object returned for completions, this is always text completion created (int) the unix timestamp (in seconds) of when the completion was created model (string) the name of the model that generated the response choices index (int) the index of the choice in the list (e g , 0 for the first choice) text (string) the generated text for this completion choice logprobs (object) this field is null unless you requested log probabilities if requested, it contains the log probabilities of the generated tokens finish reason (string) the reason the model stopped generating text common values include length (reached max tokens), stop (encountered a stop sequence), or tool calls stop reason (string) provides a more specific reason for why the model stopped, often related to internal model logic it can be null if not applicable prompt logprobs (object) similar to logprobs, but for the tokens in the initial prompt it is null unless specifically requested usage prompt tokens (int) the number of tokens in the input prompt total tokens (int) the total number of tokens used in the request (prompt + completion) completion tokens (int) the number of tokens in the generated completion prompt tokens details (object) provides a more detailed breakdown of prompt tokens it is null unless requested kv transfer params (object) an extension field (outside the official openai spec) that carries all the metadata you need to reuse or move around the model’s key/value (kv) cache instead of recomputing it on every call { "id" "cmpl 7bd54bc0b3f4d48abf3fe4fa3c11f8b", "object" "text completion", "created" 1754334436, "model" "qwen/qwen3 8b", "choices" \[ { "index" 0, "text" " washington d c ", "logprobs" null, "finish reason" "length", "stop reason" null, "prompt logprobs" null } ], "usage" { "prompt tokens" 6, "total tokens" 262, "completion tokens" 256, "prompt tokens details" null }, "kv transfer params" null } /v1 /chat/completions/ this endpoint generates a model response for a given conversational history unlike the /v1/completions/ endpoint, which is designed to continue a single text prompt, the chat endpoint excels at multi turn dialogues by providing a sequence of messages, each with a designated role (system, user, or assistant), you can simulate a conversation, and the model will generate the next appropriate message from the assistant not all llms will work with this endpoint the model must be fine tuned to understand messages and tools the default model used in the vast template will work inputs auth data signature (string) a cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server clients can use this signature, along with the server's public key, to verify that these specific details have not been tampered with cost (float) the estimated compute resources for the request the units of this cost are defined by the {{pyworker}} endpoint (string) name of the endpoint reqnum (int) the request number corresponding to this worker instance note that workers expect to receive requests in approximately the same order as these reqnums, but some flexibility is allowed due to potential out of order requests caused by concurrency or small delays on the proxy server url (string) the address of the worker instance to send the request to payload input model (string) the specific identifier of the model to be used for generating the text completion messages (array) a list of message objects that form the conversation history role (string) the role of the message author can be system, user, or assistant content (string) the content of the message max tokens (optional, int) the maximum number of tokens the model will generate for the response to the input default is 256 temperature (optional, float) a value between 0 and 2 that controls the randomness of the output higher values result in more creative and less predictable responses, while lower values make the output more deterministic default is 0 7 top k (optional, int) an integer that restricts the model to sampling from the k most likely tokens at each step of the generation process default is 20 top p (optional, float) a float between 0 and 1 that controls nucleus sampling the model considers only the most probable tokens whose cumulative probability exceeds p default is 0 4 stream (optional, bool) a boolean flag that determines the response format if true, the server will send back a stream of token by token events as they are generated if false, it will send the full completion in a single response after it's finished default is false tools (optional, list\[dict\[str, any]]) a list of function definitions that the model can call to perform external actions when a relevant tool is detected in the user's prompt, the model can generate a json object with the function name and arguments to call your code can then execute this function and return the output to the model to continue the conversation tool choice (optional, string) this parameter controls how the model uses the functions provided in the tools list it can be set to "none" to prevent the model from using any tools, "auto" to let the model decide when to call a function, or you can force the model to call a specific function by providing an object like {"type" "function", "function" {"name" "my function name"}} the max tokens parameter, rather than the messages size, impacts performance for example, if an instance is benchmarked to process 100 tokens per second, a request with max tokens = 200 will take approximately 2 seconds to complete { "auth data" { "signature" "a base64 encoded signature string from route endpoint", "cost" 2096, "endpoint" "your openai endpoint name", "reqnum" 1234567893, "url" "http //worker ip address\ port" }, "payload" { "input" { "model" "qwen/qwen3 8b", "messages" \[ { "role" "user", "content" "what's the weather like in la today?" } ], "max tokens" 256, "temperature" 0 7, "top k" 40, "top p" 0 9, "stream" false, "tools" \[ { "type" "function", "function" { "name" "get current weather", "description" "get the current weather in a given location", "parameters" { "type" "object", "properties" { "location" { "type" "string", "description" "the city and state, e g los angeles, ca" }, "unit" { "type" "string", "enum" \["celsius", "fahrenheit"], "description" "the unit of temperature" } }, "required" \["location"] } } } ], "tool choice" { "type" "function", "function" { "name" "get current weather" } } } } } outputs id (string) a unique identifier for the completion request object (string) the type of object returned for chat completions, this is always chat completion created (int) the unix timestamp (in seconds) of when the completion was created model (string) the name of the model that generated the response choices index (int) the index of the choice in the list (e g , 0 for the first choice) messages (string) a message object generated by the model role (string) the role of the message author can be system, user, or assistant content (string) the content of the message tool calls (array) contains the function call(s) the model wants to execute the arguments field is a json string containing the parameters extracted from the user's prompt finish reason (string) the reason the model stopped generating text common values include length (reached max tokens), stop (encountered a stop sequence), or tool calls usage prompt tokens (int) the number of tokens in the input prompt total tokens (int) the total number of tokens used in the request (prompt + completion) completion tokens (int) the number of tokens in the generated completion { "id" "chatcmpl a1b2c3d4 e5f6 7890 1234 5g6h7j8k9l0m", "object" "chat completion", "created" 1754336000, "model" "qwen/qwen3 8b", "choices" \[ { "index" 0, "message" { "role" "assistant", "content" null, "tool calls" \[ { "id" "call abc123xyz", "type" "function", "function" { "name" "get current weather", "arguments" "{\\"location\\" \\"los angeles, ca\\"}" } } ] }, "finish reason" "tool calls" } ], "usage" { "prompt tokens" 85, "completion tokens" 18, "total tokens" 103 } }