Serverless
Getting Started With Serverless
10 min
vast ai provides pre made serverless templates ( vllm , comfyui ) for popular use cases, and can be used with minimal setup effort in this guide, we will setup a serverless engine to handle inference requests to a model using vllm, namely qwen3 8b , using the pre made vast ai vllm serverless template this prebuilt template bundles vllm with scaling logic so you don’t have to write custom orchestration code by the end of this guide, you will be able to host the qwen3 8b model with dynamic scaling to meet your demand this guide assumes knowledge of the vast cli an introduction for it can be found here before we start, there are a few things you will need a vast ai account with credits a vast ai api key a huggingface account with a read access api token setting up a vllm + qwen3 8b serverless engine configure user environment variables navigate to the user account settings page here and drop down the "environment variables" tab in the key field, add "hf token", and in the value field add the huggingface read access token click the "+" button to the right of the fields, then click "save edits" prepare a template for our workers templates encapsulate all the information required to run an application on a gpu worker, including machine parameters, docker image, and environment variables navigate to the templates page , select the serverless filter, and click the edit button on the 'vllm (serverless)' template to get this template working, we need to customize it with our own environment variables in the environment variables section, paste " qwen/qwen3 8b" as a string value for both model name the public pre configured vllm template will not work without modification since it does not have the model name variable set the template will now work without any further edits, but you can make changes to suit your needs vast recommends keeping your template private to avoid making your hf token publically known simply click the private button, then the save & use button you should now see the vast ai search page with your template selected if you intend on using the vast cli, click more options on the template and select 'copy template hash' we will use this in step 3 create your endpoint next we will create an {{endpoint}} that any user can query for generation this can be done through the web ui or the vast cli here, we'll create an endpoint named 'vllm qwen3 8b ' navigate to the serverless page and click create endpoint a screen to create a new endpoint will pop up, with default values already assigned our endpoint will work with these default values, but you can change them to suit your needs endpoint name the name of the endpoint cold mult the multiple of the current load that is used to predict the future load for example, if we currently have 10 users, but expect there to be 20 in the near future, we can set cold mult = 2 for llms, a good default is 2 min load the baseline amount of load (tokens / second for llms) we want the endpoint to be able to handle for llms, a good default is 100 0 target util the percentage of the endpoint compute resources that we want to be in use at any given time a lower value allows for more slack, which means the endpoint will be less likely to be overwhelmed if there is a sudden spike in usage for llms, a good default is 0 9 max workers the maximum number of workers the endpoint can have at any one time cold workers the minimum number of workers kept "cold" (meaning stopped but fully loaded with your image) when the endpoint has no load having cold workers available allows the serverless system to seamlessly spin up more workers as when load increases click create, where you will be taken back to the serverless page after a few moments, the endpoint will show up with the name 'vllm qwen3 8b ' if your machine is properly configured for the vast cli, you can run the following command cli command vastai create endpoint endpoint name "vllm qwen3 8b" cold mult 1 0 min load 100 target util 0 9 max workers 20 cold workers 5 endpoint name the name you use to identify your endpoint cold mult the multiple of your current load that is used to predict your future load for example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold mult = 2 0 for llms, a good default is 2 0 min load this is the baseline amount of load (tokens / second for llms) you want your endpoint to be able to handle for llms, a good default is 100 0 target util the percentage of your endpoint compute resources that you want to be in use at any given time a lower value allows for more slack, which means your endpoint will be less likely to be overwhelmed if there is a sudden spike in usage for llms, a good default is 0 9 max workers the maximum number of workers your endpoint can have at any one time cold workers the minimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your endpoint has no load a successful creation of the endpoint should return a 'success' true as the output in the terminal create a workergroup now that we have our endpoint, we can create a {{workergroup}} with the template we prepared in step 1 from the serverless page, click '+ workergroup' under your endpoint our custom vllm (serverless) template should already be selected to confirm, click the edit button and check that the model name environment variable is filled in for our simple setup, we can enter the following values cold multiplier = 3 minimum load = 1 target utilization = 0 9 workergroup name = 'workergroup' select endpoint = 'vllm qwen3 8b ' a complete page should look like the following after entering the values, click create, where you will be taken back to the serverless page after a moment, the workergroup will be created under the 'vllm qwen3 8b ' endpoint run the following command to create your workergroup cli command vastai create workergroup endpoint name "vllm deepseek" template hash "$template hash" test workers 5 endpoint name the name of the endpoint template hash the hash code of our custom vllm (serverless) template test workers the minimum number of workers to create while initializing the workergroup this allows the workergroup to get performance estimates before serving the endpoint, and also creates workers which are fully loaded and "stopped" (aka "cold") you will need to replace "$template hash" with the template hash copied from step 1 once the workergroup is created, the serverless engine will automatically find offers and create instances this may take 10 60 seconds to find appropritate gpu workers to see the instances the system creates, click the 'view detailed stats' button on the workergroup five workers should startup, showing the 'loading' status to see the instances the autoscaler creates, run the following command cli command vastai show instances getting your first ready worker now that we have created both the endpoint and the workergroup, all that is left to do is await for the first "ready" worker we can see the status of the workers in the serverless section of the vast ai console the workers will automatically download the qwen3 8b model defined in the template, but it will take time to fully initialize the worker is loaded and benchmarked when the curr performance value is non zero when a worker has finished benchmarking, the worker's status in the workergroup will become ready we are now able to get a successful /route/ call to the workergroup and send it requests! we have now successfully created a vllm + qwen3 8b serverless engine! it is ready to receive user requests and will automatically scale up or down to meet the request demand in this next section, we will setup a client to test the serverless engine, and learn how to use the core serverless endpoints along the way using the serverless engine to fully understand this section, it is recommended to read the pyworker overview the overview shows how all the pieces related to the serverless engine work together the vast vllm (serverless) template we used in the last section already has a client written for it to use this client, we must run commands in a terminal, since there is no ui available for this section the client, along with all other files the gpu worker is cloning during initialization, can be found in the github repo here for this section, we will need the following from the repo workers/openai/ lib/ it's recommended to simply clone the entire github repo using your client py file should look like this import logging import sys import json import subprocess from urllib parse import urljoin from typing import dict, any, optional, iterator, union, list import requests from utils endpoint util import endpoint from data types client import completionconfig, chatcompletionconfig logging basicconfig( level=logging debug, format="%(asctime)s\[%(levelname) 5s] %(message)s", datefmt="%y %m %d %h %m %s", ) log = logging getlogger( file ) completions prompt = "the capital of usa is" chat prompt = "think step by step tell me about the python programming language " tools prompt = "can you list the files in the current working directory and tell me what you see? what do you think this directory might be for?" class apiclient """lightweight client focused solely on api communication""" \# remove the generic worker endpoint since we're now going direct default cost = 100 default timeout = 4 def init ( self, endpoint group name str, api key str, server url str, endpoint api key str, ) self endpoint group name = endpoint group name self api key = api key self server url = server url self endpoint api key = endpoint api key def get worker url(self, cost int = default cost) > dict\[str, any] """get worker url and auth data from routing service""" if not self endpoint api key raise valueerror("no valid endpoint api key available") route payload = { "endpoint" self endpoint group name, "api key" self endpoint api key, "cost" cost, } response = requests post( urljoin(self server url, "/route/"), json=route payload, timeout=self default timeout, ) response raise for status() return response json() def create auth data(self, message dict\[str, any]) > dict\[str, any] """create auth data from routing response""" return { "signature" message\["signature"], "cost" message\["cost"], "endpoint" message\["endpoint"], "reqnum" message\["reqnum"], "url" message\["url"], } def make request( self, payload dict\[str, any], endpoint str, method str = "post", stream bool = false, ) > union\[dict\[str, any], iterator\[str]] """make request directly to the specific worker endpoint""" \# get worker url and auth data cost = payload get("max tokens", self default cost) message = self get worker url(cost=cost) worker url = message\["url"] auth data = self create auth data(message) req data = {"payload" {"input" payload}, "auth data" auth data} url = urljoin(worker url, endpoint) log debug(f"making direct request to {url}") log debug(f"payload {req data}") \# make the request using the specified method if method upper() == "post" response = requests post(url, json=req data, stream=stream) elif method upper() == "get" response = requests get(url, params=req data, stream=stream) else raise valueerror(f"unsupported http method {method}") response raise for status() if stream return self handle streaming response(response) else return response json() def handle streaming response(self, response requests response) > iterator\[str] """handle streaming response and yield tokens""" try for line in response iter lines(decode unicode=true) if line if line startswith("data ") data str = line\[6 ] if data str strip() == "\[done]" break try data = json loads(data str) yield data # yield the full chunk except json jsondecodeerror continue except exception as e log error(f"error handling streaming response {e}") raise def call completions( self, config completionconfig ) > union\[dict\[str, any], iterator\[str]] payload = config to dict() return self make request( payload=payload, endpoint="/v1/completions", stream=config stream ) def call chat completions( self, config chatcompletionconfig ) > union\[dict\[str, any], iterator\[str]] payload = config to dict() return self make request( payload=payload, endpoint="/v1/chat/completions", stream=config stream ) class toolmanager """handles tool definitions and execution""" @staticmethod def list files() > str """execute ls on current directory""" try result = subprocess run( \["ls", " la", " "], capture output=true, text=true, timeout=10 ) if result returncode == 0 return result stdout else return f"error {result stderr}" except exception as e return f"error running ls {e}" @staticmethod def get ls tool definition() > list\[dict\[str, any]] """get the ls tool definition""" return \[ { "type" "function", "function" { "name" "list files", "description" "list files and directories in the cwd", "parameters" {"type" "object", "properties" {}, "required" \[]}, }, } ] def execute tool call(self, tool call dict\[str, any]) > str """execute a tool call and return the result""" function name = tool call\["function"]\["name"] if function name == "list files" return self list files() else raise valueerror(f"unknown tool function {function name}") class apidemo """demo and testing functionality for the api client""" def init ( self, client apiclient, model str, tool manager optional\[toolmanager] = none ) self client = client self model = model self tool manager = tool manager or toolmanager() def handle streaming response( self, response stream, show reasoning bool = true ) > str """ handle streaming chat response and display all output """ full response = "" reasoning content = "" reasoning started = false content started = false for chunk in response stream \# normalize the chunk if isinstance(chunk, str) chunk = chunk strip() if chunk startswith("data ") chunk = chunk\[6 ] strip() if chunk in \["\[done]", ""] continue try parsed chunk = json loads(chunk) except json jsondecodeerror continue elif isinstance(chunk, dict) parsed chunk = chunk else continue \# parse delta from the chunk choices = parsed chunk get("choices", \[]) if not choices continue delta = choices\[0] get("delta", {}) reasoning token = delta get("reasoning content", "") content token = delta get("content", "") \# print reasoning token if applicable if show reasoning and reasoning token if not reasoning started print("\n🧠 reasoning ", end="", flush=true) reasoning started = true print(f"\033\[90m{reasoning token}\033\[0m", end="", flush=true) reasoning content += reasoning token \# print content token if content token if not content started if show reasoning and reasoning started print(f"\n💬 response ", end="", flush=true) else print("assistant ", end="", flush=true) content started = true print(content token, end="", flush=true) full response += content token print() # ensure newline after response if show reasoning if reasoning started or content started print("\nstreaming completed ") if reasoning started print(f"reasoning tokens {len(reasoning content split())}") if content started print(f"response tokens {len(full response split())}") return full response def test tool support(self) > bool """test if the endpoint supports function calling""" log debug("testing endpoint tool calling support ") \# try a simple request with minimal tools to test support messages = \[{"role" "user", "content" "hello"}] minimal tool = \[ { "type" "function", "function" {"name" "test function", "description" "test function"}, } ] config = chatcompletionconfig( model=self model, messages=messages, max tokens=10, tools=minimal tool, tool choice="none", # don't actually call the tool ) try response = self client call chat completions(config) return true except exception as e log error(f"error endpoint does not support tool calling {e}") return false def demo completions(self) > none """demo test basic completions endpoint""" print("=" 60) print("completions demo") print("=" 60) config = completionconfig( model=self model, prompt=completions prompt, stream=false ) log info( f"testing completions with model '{self model}' and prompt '{config prompt}'" ) response = self client call completions(config) if isinstance(response, dict) print("\nresponse ") print(json dumps(response, indent=2)) else log error("unexpected response format") def demo chat(self, use streaming bool = true) > none """ demo test chat completions endpoint with optional streaming """ print("=" 60) print( f"chat completions demo {'(streaming)' if use streaming else '(non streaming)'}" ) print("=" 60) config = chatcompletionconfig( model=self model, messages=\[{"role" "user", "content" chat prompt}], stream=use streaming, ) log info(f"testing chat completions with model '{self model}' ") response = self client call chat completions(config) if use streaming try self handle streaming response(response, show reasoning=true) except exception as e log error(f"\nerror during streaming {e}") import traceback traceback print exc() return else if isinstance(response, dict) choice = response get("choices", \[{}])\[0] message = choice get("message", {}) content = message get("content", "") reasoning = message get("reasoning content", "") or message get( "reasoning", "" ) if reasoning print(f"\n🧠 reasoning \033\[90m{reasoning}\033\[0m") print(f"\n💬 assistant {content}") print(f"\nfull response ") print(json dumps(response, indent=2)) else log error("unexpected response format") def demo ls tool(self) > none """demo ask llm to list files in the current directory and describe what it sees""" print("=" 60) print("tool use demo list directory contents") print("=" 60) \# test if tools are supported first if not self test tool support() return \# request with tool available messages = \[{"role" "user", "content" tools prompt}] config = chatcompletionconfig( model=self model, messages=messages, tools=self tool manager get ls tool definition(), tool choice="auto", ) log info(f"making initial request with tool using model '{self model}' ") response = self client call chat completions(config) if not isinstance(response, dict) raise valueerror("expected dict response for tool use") choice = response get("choices", \[{}])\[0] message = choice get("message", {}) print(f"assistant response {message get('content', 'no content')}") \# check for tool calls tool calls = message get("tool calls") if not tool calls raise valueerror( "no tool calls made model may not support function calling" ) print(f"tool calls detected {len(tool calls)}") \# execute the tool call for tool call in tool calls function name = tool call\["function"]\["name"] print(f"executing tool {function name}") tool result = self tool manager execute tool call(tool call) print(f"tool result \n{tool result}") \# add tool result and continue conversation messages append(message) # add assistant's message with tool call messages append( { "role" "tool", "tool call id" tool call\["id"], "content" tool result, } ) \# get final response final config = chatcompletionconfig( model=self model, messages=messages, tools=self tool manager get ls tool definition(), ) print("getting final response ") final response = self client call chat completions(final config) if isinstance(final response, dict) final choice = final response get("choices", \[{}])\[0] final message = final choice get("message", {}) final content = final message get("content", "") print("\n" + "=" 60) print("final llm analysis ") print("=" 60) print(final content) print("=" 60) def interactive chat(self) > none """interactive chat session with streaming""" print("=" 60) print("interactive streaming chat") print("=" 60) print(f"using model {self model}") print("type 'quit' to exit, 'clear' to clear history") print() messages = \[] while true try user input = input("you ") strip() if user input lower() == "quit" print("👋 goodbye!") break elif user input lower() == "clear" messages = \[] print("chat history cleared") continue elif not user input continue messages append({"role" "user", "content" user input}) config = chatcompletionconfig( model=self model, messages=messages, stream=true, temperature=0 7 ) print("assistant ", end="", flush=true) response = self client call chat completions(config) assistant content = self handle streaming response( response, show reasoning=true ) \# add assistant response to conversation history messages append({"role" "assistant", "content" assistant content}) except keyboardinterrupt print("\n👋 chat interrupted goodbye!") break except exception as e log error(f"\nerror {e}") continue def main() """main function with cli switches for different tests""" from lib test utils import test args \# add mandatory model argument test args add argument( " model", required=true, help="model to use for requests (required)" ) \# add test mode arguments test args add argument( " completion", action="store true", help="test completions endpoint" ) test args add argument( " chat", action="store true", help="test chat completions endpoint (non streaming)", ) test args add argument( " chat stream", action="store true", help="test chat completions endpoint with streaming", ) test args add argument( " tools", action="store true", help="test function calling with ls tool (non streaming)", ) test args add argument( " interactive", action="store true", help="start interactive streaming chat session", ) args = test args parse args() \# check that only one test mode is selected test modes = \[ args completion, args chat, args chat stream, args tools, args interactive, ] selected count = sum(test modes) if selected count == 0 print("please specify exactly one test mode ") print(" completion test completions endpoint") print(" chat test chat completions endpoint (non streaming)") print(" chat stream test chat completions endpoint with streaming") print(" tools test function calling with ls tool (non streaming)") print(" interactive start interactive streaming chat session") print( f"\nexample python {sys argv\[0]} model qwen/qwen3 8b chat stream k your key e your endpoint" ) sys exit(1) elif selected count > 1 print("please specify exactly one test mode") sys exit(1) try endpoint api key = endpoint get endpoint api key( endpoint name=args endpoint group name, account api key=args api key, instance=args instance, ) if not endpoint api key log error( f"could not retrieve api key for endpoint '{args endpoint group name}' exiting " ) sys exit(1) \# create the core api client client = apiclient( endpoint group name=args endpoint group name, api key=args api key, server url=args server url, endpoint api key=endpoint api key, ) \# create tool manager and demo (passing the model parameter) tool manager = toolmanager() demo = apidemo(client, args model, tool manager) print(f"using model {args model}") print("=" 60) \# run the selected test if args completion demo demo completions() elif args chat demo demo chat(use streaming=false) elif args chat stream demo demo chat(use streaming=true) elif args tools demo demo ls tool() elif args interactive demo interactive chat() except exception as e log error(f"error during test {e}", exc info=true) sys exit(1) if name == " main " main() as the user, we want all the files under 'user' to be in our file system the gpu workers that the system initializes will have the files and entities under 'gpu worker' files and entities for the user and gpu worker api keys before we get started, it is important to know that upon creation of your first serverless endpoint, you will obtain a special api key specifically for serverless this key is unique to you, and you will use it for all of your calls to the serverless engine (any request to https //run vast ai) this key is different from your standard user key and only works with your serverless endpoints when following along with the rest of the getting started guide, keep in mind that when you see references to $your user api key, you will want to replace that with your personal user api key and when you see $your serverless api key, you will want to use your unique serverless api key where to find your personal user api key your can find your personal user api key in your account at https //cloud vast ai/manage keys/ https //cloud vast ai/manage keys/ your api keys may appear truncated just click the copy button to the left to copy your full user api key where to find your serverless api key you can use the vast cli to find your serverless specific api key (this assumes you have already created your first serverless endpoint ) install the tls certificate \[optional] all of vast ai's pre made serverless templates use ssl by default if you want to disable it, you can add e use ssl=false to the docker options in your copy of the template the serverless engine will automatically adjust the instance url to enable or disable ssl as needed download vast ai's certificate from here in the python environment where you're running the client script, execute the following command python3 m certifi the command in step 2 will print the path to a file where certificates are stored append vast ai's certificate to that file using the following command cat jvastai root cer >> path/to/cert/store you may need to run the above command with sudo if you are not running python in a virtual environment this process only adds vast ai's tls certificate as a trusted certificate for python clients for non python clients, you'll need to add the certificate to the trusted certificates for that specific client if you encounter any issues, feel free to contact us on support chat for assistance running client py in client py, we are first sending a post request to the /route/ endpoint this sends a request to the serverless engine asking for a ready worker, with a payload that looks like route payload = { "endpoint" "endpoint name", "api key" "your serverless api key", "cost" cost, } the engine will reply with a valid worker address, where client py then calls the /v1/completions endpoint with the authentication data returned by the serverless engine and the user's model input text as the payload { "auth data" { "cost" 256 0, "endpoint" "endpoint name", "reqnum" req num, "signature" "signature", "url" "worker address" }, "payload" { "input" { "model" "qwen/qwen3 8b", "prompt" "the capital of usa is", "temperature" 0 7, "max tokens" 256, "top k" 20, "top p" 0 4, "stream" false} } } } the worker hosting the qwen3 8b model will return the model results to the client, and print them to the user to quickly run a basic test of the serverless engine with vllm, navigate to the pyworker directory in your terminal and run cli command pip install r requirements txt && \\ python3 m workers openai client k "$your user api key" e "vllm qwen3 8b" model "qwen/qwen3 8b" completion make sure to set the api key variable in your environment, or replace it by pasting in your actual key you only need to install the requirements txt file on your first run this should result in your "ready" worker with the qwen3 8b model printing a completion demo to your terminal window if we enter the same command without completion, you will see all of the test modes vllm has depending on your model, some may or not return a good response depending on what its capabilities are because we are testing with qwen3 8b, all of these test modes should provide a response cli command python3 m workers openai client k "$your user api key" e "vllm qwen3 8b" model "qwen/qwen3 8b" please specify exactly one test mode \ completion test completions endpoint \ chat test chat completions endpoint (non streaming) \ chat stream test chat completions endpoint with streaming \ tools test function calling with ls tool (non streaming) \ interactive start interactive streaming chat session monitor your groups there are several endpoints we can use to monitor the status of the serverless engine to fetch all endpoint logs , run the following curl command curl https //run vast ai/get endpoint logs/ \\ x post \\ d '{"endpoint" "vllm qwen3 8b", "api key" "$your serverless api key"}' \\ h 'content type application/json' similarily, to fetch all worker group logs , execute curl https //run vast ai/get workergroup logs/ \\ x post \\ d '{"id" workergroup id, "api key" "$your serverless api key"}' \\ h 'content type application/json' all endpoints and worker groups continuously track their performance over time, which is sent to the serverless engine as metrics to see workergroup metrics , run the following curl x post "https //console vast ai/api/v0/serverless/metrics/" \\ h "content type application/json" \\ d '{ "start date" 1749672382 157, "end date" 1749680792 188, "step" 500, "type" "autogroup", "metrics" \[ "capacity", "curload", "nworkers", "nrdy workers ", "reliable", "reqrate", "totreqs", "perf", "nrdy soon workers ", "model disk usage", "reqs working" ], "resource id" '"${workergroup id}"' }' load testing in the github repo that we cloned earlier, there is a load testing script called workers/openai/test load py the n flag indicates the total number of requests to send to the serverless engine, and the rps flag indicates the rate (requests/second) the script will print out statistics that show metrics like total requests currently being generated number of successful generations number of errors total number of workers used during the test to run this script, make sure the python packages from requirements txt are installed, and execute the following command python3 m workers openai test load n 100 rps 1 k "$your user api key" e "vllm qwen3 8b" model "qwen/qwen3 8b" this is everything you need to start, test, and monitor your vllm + qwen3 8b serverless engine! there are other vast pre made templates , like the comfyui image generation model, that can also be setup in a similar fashion