Serverless
Getting Started With Serverless
8 min
for users not familiar with vast ai's serverless engine, we recommend starting with the serverless architecture documentation it will be helpful in understanding how the system operates, processes requests, and manages resources overview & prerequisites vast ai provides pre made serverless templates ( vllm , comfyui ) for popular use cases, and can be used with minimal setup effort in this guide, we will setup a serverless engine to handle inference requests to a model using vllm, namely qwen3 8b , using the pre made vast ai vllm serverless template this prebuilt template bundles vllm with scaling logic so you don’t have to write custom orchestration code by the end of this guide, you will be able to host the qwen3 8b model with dynamic scaling to meet your demand this guide assumes knowledge of the vast cli an introduction for it can be found here before we start, there are a few things you will need a vast ai account with credits a vast ai api key a huggingface account with a read access api token setting up a vllm + qwen3 8b serverless engine configure user environment variables navigate to the user account settings page here and drop down the "environment variables" tab in the key field, add "hf token", and in the value field add the huggingface read access token click the "+" button to the right of the fields, then click "save edits" prepare a template for our workers templates encapsulate all the information required to run an application on a gpu worker, including machine parameters, docker image, and environment variables navigate to the templates page , select the serverless filter, and click the edit button on the ' vllm + qwen/qwen3 8b (serverless) ' template in the environment variables section, " qwen/qwen3 8b" is the default value for model name , but can be changed to any compatible vllm model on huggingface set this template to private and click save & use the template will now work without any further edits, but can be customized to suit specific needs vast recommends keeping the template private to avoid making any private information publically known we should now see the vast ai search page with the template selected for those intending to use the vast cli, click more options on the template and select 'copy template hash' we will use this in step 3 create the endpoint next we will create an {{endpoint}} that any user can query for generation this can be done through the web ui or the vast cli here, we'll create an endpoint named 'vllm qwen3 8b ' navigate to the serverless page and click create endpoint a screen to create a new endpoint will pop up, with default values already assigned our endpoint will work with these default values, but you can change them to suit your needs endpoint name the name of the endpoint cold mult the multiple of the current load that is used to predict the future load for example, if we currently have 10 users, but expect there to be 20 in the near future, we can set cold mult = 2 for llms, a good default is 2 min load the baseline amount of load (tokens / second for llms) we want the endpoint to be able to handle for llms, a good default is 100 0 target util the percentage of the endpoint compute resources that we want to be in use at any given time a lower value allows for more slack, which means the endpoint will be less likely to be overwhelmed if there is a sudden spike in usage for llms, a good default is 0 9 max workers the maximum number of workers the endpoint can have at any one time cold workers the minimum number of workers kept "cold" (meaning stopped but fully loaded with the image) when the endpoint has no load having cold workers available allows the serverless system to seamlessly spin up more workers as when load increases click create, where you will be taken back to the serverless page after a few moments, the endpoint will show up with the name 'vllm qwen3 8b ' if your machine is properly configured for the vast cli, you can run the following command cli command vastai create endpoint endpoint name "vllm qwen3 8b" cold mult 1 0 min load 100 target util 0 9 max workers 20 cold workers 5 endpoint name the name you use to identify your endpoint cold mult the multiple of your current load that is used to predict your future load for example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold mult = 2 0 for llms, a good default is 2 0 min load this is the baseline amount of load (tokens / second for llms) you want your endpoint to be able to handle for llms, a good default is 100 0 target util the percentage of your endpoint compute resources that you want to be in use at any given time a lower value allows for more slack, which means your endpoint will be less likely to be overwhelmed if there is a sudden spike in usage for llms, a good default is 0 9 max workers the maximum number of workers your endpoint can have at any one time cold workers the minimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your endpoint has no load a successful creation of the endpoint should return a 'success' true as the output in the terminal create a workergroup now that we have our endpoint, we can create a {{workergroup}} with the template we prepared in step 1 from the serverless page, click '+ workergroup' under the endpoint our custom vllm (serverless) template should already be selected to confirm, click the edit button and check that the model name environment variable is filled in for our simple setup, we can enter the following values cold multiplier = 3 minimum load = 1 target utilization = 0 9 workergroup name = 'workergroup' select endpoint = 'vllm qwen3 8b ' a complete page should look like the following after entering the values, click create, where you will be taken back to the serverless page after a moment, the workergroup will be created under the 'vllm qwen3 8b ' endpoint run the following command to create your workergroup cli command vastai create workergroup endpoint name "vllm deepseek" template hash "$template hash" test workers 5 endpoint name the name of the endpoint template hash the hash code of our custom vllm (serverless) template test workers the minimum number of workers to create while initializing the workergroup this allows the workergroup to get performance estimates before serving the endpoint, and also creates workers which are fully loaded and "stopped" (aka "cold") you will need to replace "$template hash" with the template hash copied from step 1 once the workergroup is created, the serverless engine will automatically find offers and create instances this may take 10 60 seconds to find appropritate gpu workers to see the instances the system creates, click the 'view detailed stats' button on the workergroup five workers should startup, showing the 'loading' status to see the instances the autoscaler creates, run the following command cli command vastai show instances getting the first ready worker now that we have created both the endpoint and the workergroup, all that is left to do is await for the first "ready" worker we can see the status of the workers in the serverless section of the vast ai console the workers will automatically download the qwen3 8b model defined in the template, but it will take time to fully initialize the worker is loaded and benchmarked when the curr performance value is non zero when a worker has finished benchmarking, the worker's status in the workergroup will become ready we are now able to get a successful /route/ call to the workergroup and send it requests! we have now successfully created a vllm + qwen3 8b serverless engine! it is ready to receive user requests and will automatically scale up or down to meet the request demand in this next section, we will setup a client to test the serverless engine, and learn how to use the core serverless endpoints along the way using the serverless engine to fully understand this section, it is recommended to read the pyworker overview the overview shows how all the pieces related to the serverless engine work together the vast vllm (serverless) template we used in the last section already has a client (client py) written for it to use this client, we must run commands in a terminal, since there is no ui available for this section the client, along with all other files the gpu worker is cloning during initialization, can be found in the vast ai github repo for this section, simply clone the entire repo using git clone https //github com/vast ai/pyworker git as the user, we want all the files under 'user' to be in our file system the gpu workers that the system initializes will have the files and entities under 'gpu worker' files and entities for the user and gpu worker api keys upon creation of a serverless endpoint group, the group will obtain a special api key specifically for serverless this key is unique to an account, and will be used for all calls to the serverless engine this key is different from a standard vast ai api key and only works with serverless endpoint groups where to find a serverless api key use the vast cli to find a serverless api key cli command vastai show endpoints the show endpoints command will return a json blob like this { "api key" "952laufhuefiu2he72yhewikhf28732873827uifdhfiuh2ifh72hs80a8s728c699s9", "cold mult" 2 0, "cold workers" 3, "created at" 1755115734 0841732, "endpoint name" "vllm qwen3 8b", "endpoint state" "active", "id" 1234, "max workers" 5, "min load" 10 0, "target util" 0 9, "user id" 123456 } install the tls certificate \[optional] all of vast ai's pre made serverless templates use ssl by default if you want to disable it, you can add e use ssl=false to the docker options in your copy of the template the serverless engine will automatically adjust the instance url to enable or disable ssl as needed download vast ai's certificate from here in the python environment where you're running the client script, execute the following command python3 m certifi the command in step 2 will print the path to a file where certificates are stored append vast ai's certificate to that file using the following command cat jvastai root cer >> path/to/cert/store you may need to run the above command with sudo if you are not running python in a virtual environment this process only adds vast ai's tls certificate as a trusted certificate for python clients for non python clients, you'll need to add the certificate to the trusted certificates for that specific client if you encounter any issues, feel free to contact us on support chat for assistance running client py in client py, we are first sending a post request to the /route/ endpoint this sends a request to the serverless engine asking for a ready worker, with a payload that looks like route payload = { "endpoint" "endpoint name", "api key" "your serverless api key", "cost" cost, } the cost input here tells the serverless engine how much workload to expect for this request, and is not related to credits on a vast ai account the engine will reply with a valid worker address, where client py then calls the /v1/completions endpoint with the authentication data returned by the serverless engine and the user's model input text as the payload { "auth data" { "cost" 256 0, "endpoint" "endpoint name", "reqnum" "req num", "signature" "signature", "url" "worker address" }, "payload" { "input" { "model" "qwen/qwen3 8b", "prompt" "the capital of usa is", "temperature" 0 7, "max tokens" 256, "top k" 20, "top p" 0 4, "stream" false} } } } the worker hosting the qwen3 8b model will return the model results to the client, and print them to the user to quickly run a basic test of the serverless engine with vllm, navigate to the pyworker directory and run cli command pip install r requirements txt && \\ python3 m workers openai client k "$your user api key" e "vllm qwen3 8b" model "qwen/qwen3 8b" completion client py is configured to work with a vast ai api key, not a serverless api key make sure to set the api key variable in your environment, or replace it by pasting in your actual key you only need to install the requirements txt file on the first run this should result in a "ready" worker with the qwen3 8b model printing a completion demo to your terminal window if we enter the same command without completion, you will see all of the test modes vllm has because we are testing with qwen3 8b, all test modes will provide a response (not all llms are equipped to use tools) cli command python3 m workers openai client k "$your user api key" e "vllm qwen3 8b" model "qwen/qwen3 8b" please specify exactly one test mode \ completion test completions endpoint \ chat test chat completions endpoint (non streaming) \ chat stream test chat completions endpoint with streaming \ tools test function calling with ls tool (non streaming) \ interactive start interactive streaming chat session monitoring groups there are several endpoints we can use to monitor the status of the serverless engine to fetch all endpoint logs , run the following curl command curl https //run vast ai/get endpoint logs/ \\ x post \\ d '{"endpoint" "vllm qwen3 8b", "api key" "$your serverless api key"}' \\ h 'content type application/json' similarily, to fetch all workergroup logs , execute curl https //run vast ai/get workergroup logs/ \\ x post \\ d '{"id" workergroup id, "api key" "$your serverless api key"}' \\ h 'content type application/json' all endpoints and workergroups continuously track their performance over time, which is sent to the serverless engine as metrics to see workergroup metrics , run the following curl x post "https //console vast ai/api/v0/serverless/metrics/" \\ h "content type application/json" \\ d '{ "start date" 1749672382 157, "end date" 1749680792 188, "step" 500, "type" "autogroup", "metrics" \[ "capacity", "curload", "nworkers", "nrdy workers ", "reliable", "reqrate", "totreqs", "perf", "nrdy soon workers ", "model disk usage", "reqs working" ], "resource id" '"${workergroup id}"' }' these metrics are displayed in a workergroup's ui page load testing in the github repo that we cloned earlier, there is a load testing script called workers/openai/test load py the n flag indicates the total number of requests to send to the serverless engine, and the rps flag indicates the rate (requests/second) the script will print out statistics that show metrics like total requests currently being generated number of successful generations number of errors total number of workers used during the test to run this script, make sure the python packages from requirements txt are installed, and execute the following command python3 m workers openai test load n 100 rps 1 k "$your user api key" e "vllm qwen3 8b" model "qwen/qwen3 8b" this is everything you need to start, test, and monitor a vllm + qwen3 8b serverless engine! there are other vast pre made templates , like the comfyui image generation model, that can be setup in a similar fashion