Autoscaler
Getting Started
11min
some popular templates on vast such as text generation inference and comfy ui can be run in api mode to act as inference servers as the backend of an application vast's autoscaling service automates instance management, performance tracking, and error handling the autoscaler also provides authentication services to ensure requests coming to your vast instances are only coming from approved clients note this guide assumes knowledge of the vast cli, and an introduction for it can be found here we also highly recommend reading about the autoscaler architecture here before you start for this example, we will set up an endpoint group that uses tgi and llama3 model to serve inference requests 1\) create your endpoint group to use api endpoint, you need to create an "endpoint group" (aka endptgroup) that manages your endpoint in response to incoming load you can do this through the gui you can also use the cli here, we'll create an endpoint group named "tgi llama3" vastai create endpoint endpoint name "tgi llama3" cold mult 1 0 min load 100 target util 0 9 max workers 20 cold workers 5 "min load" this is the baseline amount of load (tokens / second for llms) you want your autoscaling group to be able to handle for llms, a good default is 100 0 for text2image, a good default is 200 0 "target util" the percentage of your autogroup compute resources that you want to be in use at any given time a lower value allows for more slack, which means your instance group will be less likely to be overwhelmed if there is a sudden spike in usage for llms, a good default is 0 9 for text2image, a good default is 0 4 comfyui, the backend used for text2image does not support parallel requests this means requests are queued and handled one at a time this means your instances can quickly build a long queue and get overwhelmed if you want to ensure your users never experience requests time, you should leave a lot of slack "cold mult" the multiple of your current load that is used to predict your future load, for example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold mult = 2 0 this should be set to 2 0 to begin for both llms and text2image "max workers" the maximum number of workers your endpoint group can have "cold workers" the minimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your group has no load note that this is only taken into account if you already have workers which are fully loaded but are no longer needed a good way to ensure that you have enough workers that are loaded is setting the "test workers" parameter of the autogroup correctly 2\) prepare the template templates encapsulate all the information required to run an application with the autoscaler, including machine parameters, docker image, and environment variables for some of our popular templates, we have created a few autoscaler compatible templates that allow you to serve specific models in api mode on hardware that is best suited to the specific model the templates we offer which are pre configured to work with the autoscaler can be found on our autoscaler templates page https //docs vast ai/serverless/templates reference you can create a autogroup using one of those templates by specifying the template hash on autogroup creation note the public pre configured templates should not be used as they don't have hf token variable not set you must create a private copy of those templates that have hf token set to your huggingface api token and use the template hash of those private templates instead huggingface api token is needed to download gated models go to the autoscaler tgi template https //cloud vast ai/?template id=dcd0920ffd9d026b7bb2d42f0d7479ba , create a new private copy of this template and set the hf token to your huggingface api token and model id to "meta llama/meta llama 3 8b instruct" llama3 is a gated model, so be sure to go to the huggingface model page for meta llama/meta llama 3 8b instruct while logged into your huggingface account and accept the terms and conditions of the model your huggingface api token should be a read type token a fine grained token works as well, as long as it has the "read access to contents of all public gated repos you can access" permission 3\) create an autoscaling group endpoint groups consist of one or more "autoscaling groups" (aka autogroups) autogroups describe the machine configurations and parameters that will serve the requests, and they can be fully defined by a template use the template hash of the template created in the previous step, and the endpoint name from step 1 vastai create autogroup endpoint name "tgi llama3" template hash "$template hash" test workers 5 "test workers" min number of workers to create while initializing autogroup this allows the autogroup to get performance estimates from machines running your configurations before deploying them to serve your endpoint this will also allow you to create workers which are fully loaded and "stopped" (aka "cold") so that they can be started quickly when you introduce load to your endpoint note that if you don't create an endpoint group explictly before creating your autogroup, the endpoint group with your given name will be created in your account, you will just need to make sure that the endpoint group parameters ( cold mult min load target util) are set correctly once you have an autogroup to define your machine configuration and an endpoint group to define a managed endpoint for your api, vast's autoscaling server will go to work automatically finding offers and creating instances from them for your api endpoint the instances the autoscaler creates will be accessible from your account and will have a tag corresponding to the name of your endpoint group 4\) send a request to your endpoint group it might take a few minutes for your first instances to be created and for the model to be downloaded onto them for instances with low bandwidth, it can take up to 15 minutes to download a large model such as flux once an instance has fully loaded the model, you can call the /route/ endpoint to obtain the address of your api endpoint on one of your worker servers if no workers are ready, the route endpoint will indicate the number of loading workers in the "status" field of the returned json you can see what metrics are being sent to the autoscaler in the instance logs your instance is loaded and benchmarked when cur perf value is non zero install the tls certificate all of vast ai's autoscaler templates use ssl by default if you want to disable it, you can add e use ssl=false to the docker options in your copy of the template the autoscaler will automatically adjust the instance url to enable or disable ssl as needed download vast ai's certificate from here https //console vast ai/static/jvastai root cer in the python environment where you're running the client script, execute the following command python3 m certifi the command in step 2 will print the path to a file where certificates are stored append vast ai's certificate to that file using the following command cat jvastai root cer >> path/to/cert/store you may need to run the above command with sudo if you are not running python in a virtual environment note this process only adds vast ai's tls certificate as a trusted certificate for python clients if you need to add the certificate system wide on windows or macos, follow the steps outlined here https //docs vast ai/instances/templates for non python clients, you'll need to add the certificate to the trusted certificates for that specific client if you encounter any issues, feel free to contact us on support chat for assistance client code here is an example of calling the https //run vast ai/route/ https //run vast ai/route/ endpoint, and then forwarding a model request to the returned worker address you can find tgi's endpoints and payload format here https //docs vast ai/serverless/templates reference import requests from typing import dict, any from urllib parse import urljoin def get auth data(api key str, endpoint name str, cost int) > dict\[str, any] """ auth data sent back is in this format { signature str cost str endpoint str reqnum int url str } `url` is the ip address and port of the instance the rest of the data is used for authentication """ response = requests post( "https //run vast ai/route/", json={ "api key" api key, "endpoint" endpoint name, "cost" cost, }, ) return response json() def get endpoint group response( api key str, endpoint name str, cost int, inputs str, parameters dict\[str, any] ) auth data = get auth data(api key, endpoint name, cost) \# payload format should follow the format referenced in "templates reference" page in autoscaler docs \# in this example, payload is formatted for tgi payload = {"inputs" inputs, "parameters" parameters} \# this is the format of requests for all implementations of pyworker auth data is always the same data \# returned by autoscaler's `/route` endpoint pyworker payload = {"auth data" auth data, "payload" payload} \# use the returned url + your expected endpoint \# for tgi, `/generate` endpoint is the pyworker endpoint for generating an llm response url = urljoin(auth data\["url"], "/generate") response = requests post(f"{url}", json=pyworker payload) return response text \# example usage \# this should be your vast api key api key = "your vast api key" \# endpoint name from step 1 endpoint name = "tgi llama3" \# cost is estimated number of tokens for request \# for tgi, a good default is max new tokens \# for comfy ui, the calculation is more complex, but a good default is 200 cost = 256 \# you will also need to provide a payload object with your endpoint's expected query parameters \# in this example, we are using an expected payload for our tgi example in our docs inputs = "what is the best movie of all time?" parameters = {"max new tokens" cost} reply from endpoint = get endpoint group response( api key=api key, endpoint name=endpoint name, cost=cost, inputs=inputs, parameters=parameters, ) print("request sent to endpoint ", inputs) print("response from endpoint ", reply from endpoint) for a full working client for all backends, see pyworker https //github com/vast ai/pyworker client script can be found in workers/$backend/client py you can use these scripts to test your endpoint groups since we have created a tgi template, we'll use the tgi client to test our endpoint install the requirements with pip install r requirements txt , and run the client tgi python3 m workers tgi client k "$api key" e "tgi llama3" you should get two responses printed out, first is a synchronous, full response, and the second is streaming, where the model response is printed one token at a time 5\) monitor your groups there is an endpoint on the autoscaler server so that you can access the logs corresponding to your endpoint group, and autogroups which is described here https //docs vast ai/serverless/logs there is also an endpoint which allows you to see metrics for your groups, which is described here https //docs vast ai/serverless/stats 6\) load testing there is a script for each backend to load test your instances n flag indicates the total number of requests to send, and rps flag indicates the rate(requests/second) the script will print out some statistics on many requests are being handled per minute you can run it by installing the required python packages similar to step 4, and run the following command python3 m workers tgi test load n 1000 rps 1 k "$api key" e "tgi"