Autoscaler
Getting Started
7min
vast ai provides pre made autoscaler templates for popular use cases, and can be used with minimal setup effort in this guide, we will setup an autoscaler to serve inference requests to a text generation model, namely llama 3, using the pre made vast ai tgi autoscaler template this prebuilt template bundles tgi with autoscaling logic so you don’t have to write custom orchestration code by the end of this guide, you will be able to host the llama 3 model with dynamic scaling to meet your demand this guide assumes knowledge of the vast cli an introduction for it can be found here before we start, there are a few things you will need a vast ai account with credits a vast ai api key a huggingface account with a read access api token access to the meta llama/meta llama 3 8b model by accepting their terms and conditions on your huggingface account setting up a text generation inference autoscaler prepare a template for your workers templates encapsulate all the information required to run an application with the autoscaler, including machine parameters, docker image, and environment variables navigate to the autoscaler templates page https //cloud vast ai/templates/ , select the autoscaler filter, and click the edit button on the 'tgi (autoscaler)' template to get this template working, we need to customize it with our own environment variables in the environment variables section, paste your huggingface read access api token as a string value for hf token and paste ' meta llama/meta llama 3 8b instruct' as a string value for model id hf token and model id values added the public pre configured tgi template will not work without modification since it does not have an hf token or model id variable set the template will now work without any further edits, but you can make changes to suit your needs vast recommends keeping your template private to avoid making your hf token publically known simply click the private button, then the save & use button you should now see the vast ai search page with your template selected if you intend on using the vast cli, click more options on the template and select 'copy template hash' we will use this in step 3 tgi (autoscaler) private template selected create your endpoint next we will create an {{endpoint}} that any user can query for llama 3 text generation this can be done through the web ui or the vast cli here, we'll create an endpoint named 'tgi llama3' navigate to the autoscaler page and click create endpoint a screen to create a new endpoint will pop up, with default values already assigned our endpoint will work with these default values, but you can change them to suit your needs endpoint creation screen with default values endpoint name the name of the endpoint cold mult the multiple of the current load that is used to predict the future load for example, if we currently have 10 users, but expect there to be 20 in the near future, we can set cold mult = 2 0 for llms, a good default is 2 0 min load the baseline amount of load (tokens / second for llms) we want the endpoint to be able to handle for llms, a good default is 100 0 target util the percentage of the endpoint compute resources that we want to be in use at any given time a lower value allows for more slack, which means the endpoint will be less likely to be overwhelmed if there is a sudden spike in usage for llms, a good default is 0 9 max workers the maximum number of workers the endpoint can have at any one time cold workers the minimum number of workers kept "cold" (meaning stopped and fully loaded) when the endpoint has no load click create, where you will be taken back to the autoscaler page after a few moments, the endpoint will show up with the name 'tgi llama3' if your machine is properly configured for the vast cli, you can run the following command cli command vastai create endpoint endpoint name "tgi llama3" cold mult 1 0 min load 100 target util 0 9 max workers 20 cold workers 5 endpoint name the name you use to identify your endpoint cold mult the multiple of your current load that is used to predict your future load for example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold mult = 2 0 for llms, a good default is 2 0 min load this is the baseline amount of load (tokens / second for llms) you want your endpoint to be able to handle for llms, a good default is 100 0 target util the percentage of your endpoint compute resources that you want to be in use at any given time a lower value allows for more slack, which means your endpoint will be less likely to be overwhelmed if there is a sudden spike in usage for llms, a good default is 0 9 max workers the maximum number of workers your endpoint can have at any one time cold workers the minimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your endpoint has no load a successful creation of the endpoint should return a 'success' true as the output in the terminal create a worker group now that we have our endpoint, we can create a {{worker group}} with the template we prepared in step 1 from the autoscaler page, click '+ worker group' under your endpoint our custom tgi (autoscaler) template should already be selected to confirm, click the edit button and check that the hf token and model id environment variables are filled in for our simple setup, we can enter the following values cold multiplier = 3 minimum load = 1 target utilization = 0 9 worker group name = 'worker group' select endpoint = 'tgi llama3' a complete page should look like the following after entering the values, click create, where you will be taken back to the autoscaler page after a moment, the worker group will be created under the 'tgi llama3' endpoint run the following command to create your worker group cli command vastai create autogroup endpoint name "tgi llama3" template hash "$template hash" test workers 5 endpoint name the name of the endpoint template hash the hash code of our custom tgi autoscaler template test workers the minimum number of workers to create while initializing the worker group this allows the worker group to get performance estimates before serving the endpoint, and also creates workers which are fully loaded and "stopped" (aka "cold") you will need to replace "$template hash" with the template hash copied from step 1, or add template hash as a value in your local environment once the worker group is created, vast's autoscaling server will automatically find offers and create instances this may take 10 60 seconds to find appropritate gpu workers to see the instances the autoscaler creates, click the 'view detailed stats' button on the worker group five workers should startup, showing the 'loading' status five workers in 'loading' state to see the instances the autoscaler creates, run the following command cli command vastai show instances the workers will automatically download the llama 3 model, but it will take time to fully initialize your instance is loaded and benchmarked when the current performance value is non zero we have now successfully created a text generation inference autoscaler! it is ready to receive user requests and will automatically scale up or down to meet the request demand in this next section, we will setup a client to test the autoscaler, and learn how to use the core autoscaler endpoints along the way using the autoscaler to fully understand this section, it is recommended to read the pyworker overview the overview shows how all the pieces related to the autoscaler system work together the vast tgi autoscaler template we used in the last section already has a client written for it to use this client, we must run commands in a terminal, since there is no ui available for this section the client, along with all other files the gpu worker is cloning during initialization, can be found in the github repo here for this section, we will need the following from the repo workers/tgi/client py the entire lib/ directory from the pyworker repo it's recommended to simply clone the entire github repo we will not be referencing the comfyui or hello world worker in this guide your client py file should look like this import sys import json from urllib parse import urljoin import requests def call generate(endpoint group name str, api key str, server url str) > none worker endpoint = "/generate" cost = 100 route payload = { "endpoint" endpoint group name, "api key" api key, "cost" cost, } response = requests post( urljoin(server url, "/route/"), json=route payload, timeout=4, ) message = response json() url = message\["url"] auth data = dict( signature=message\["signature"], cost=message\["cost"], endpoint=message\["endpoint"], reqnum=message\["reqnum"], url=message\["url"], ) payload = dict(inputs="tell me about cats", parameters=dict(max new tokens=500)) req data = dict(payload=payload, auth data=auth data) url = urljoin(url, worker endpoint) print(f"url {url}") response = requests post( url, json=req data, ) res = response json() print(res) def call generate stream(endpoint group name str, api key str, server url str) worker endpoint = "/generate stream" cost = 100 route payload = { "endpoint" endpoint group name, "api key" api key, "cost" cost, } response = requests post( urljoin(server url, "/route/"), json=route payload, timeout=4, ) message = response json() url = message\["url"] print(f"url {url}") auth data = dict( signature=message\["signature"], cost=message\["cost"], endpoint=message\["endpoint"], reqnum=message\["reqnum"], url=message\["url"], ) payload = dict(inputs="tell me about dogs", parameters=dict(max new tokens=500)) req data = dict(payload=payload, auth data=auth data) url = urljoin(url, worker endpoint) response = requests post(url, json=req data, stream=true) for line in response iter lines() payload = line decode() lstrip("data ") rstrip() if payload data = json loads(payload) print(data\["token"]\["text"], end="") sys stdout flush() print() if name == " main " from lib test utils import test args args = test args parse args() call generate( api key=args api key, endpoint group name=args endpoint group name, server url=args server url, ) call generate stream( api key=args api key, endpoint group name=args endpoint group name, server url=args server url, ) as the user, we want all the files under 'user' to be in our file system the gpu workers that the autoscaler initializes will have the files and entities under 'gpu worker' files and entities for the user and gpu worker install the tls certificate \[optional] all of vast ai's pre made autoscaler templates use ssl by default if you want to disable it, you can add e use ssl=false to the docker options in your copy of the template the autoscaler will automatically adjust the instance url to enable or disable ssl as needed download vast ai's certificate from here in the python environment where you're running the client script, execute the following command python3 m certifi the command in step 2 will print the path to a file where certificates are stored append vast ai's certificate to that file using the following command cat jvastai root cer >> path/to/cert/store you may need to run the above command with sudo if you are not running python in a virtual environment this process only adds vast ai's tls certificate as a trusted certificate for python clients if you need to add the certificate system wide on windows or macos, follow the steps outlined here for non python clients, you'll need to add the certificate to the trusted certificates for that specific client if you encounter any issues, feel free to contact us on support chat for assistance running client py in client py, we are first calling the /route/ endpoint this sends a request to the autoscaler asking for a ready worker, with a payload that looks like route payload = { "endpoint" endpoint name, "api key" api key, "cost" cost, } the autoscaler will reply with a valid worker address, where client py then calls the /generate endpoint with the authentication data returned by the autoscaler and the user's model input text as the payload { "auth data" { "cost" 256 0, "endpoint" "{{endpoint name}}", "reqnum" {{req num}}, "signature" "{{signature}}", "url" "{{worker address}}" }, "payload" { "inputs" "what is the best movie of all time?", "parameters" { "max new tokens" 256 } } } the worker hosting the llama 3 model will return the model results to the client, and print them to the user client py will also call the /generate stream endpoint in a similar workflow, which returns the model's output as a streaming response to test the autoscaler, navigate to the pyworker file in your terminal and run cli command pip install r requirements txt && \\ python3 m workers tgi client k "$api key" e "tgi llama3" make sure to set the api key variable in your environment, or replace it by pasting in your actual key you only need to install the requirements txt file on your first run two responses should be printed out the first is a synchronous full response, and the second is a streaming response, where the response is printed one token at a time monitor your groups there are several endpoints we can use to monitor the status of the autoscaler to fetch all endpoint logs , run the following curl command curl https //run vast ai/get endpoint logs/ \\ x post \\ d '{"endpoint" "tgi llama3", "api key" "$api key"}' \\ h 'content type application/json' similarily, to fetch all worker group logs , execute curl https //run vast ai/get autogroup logs/ \\ x post \\ d '{"endpoint" "tgi llama3", "api key" "$api key"}' \\ h 'content type application/json' all endpoints and worker groups continuously track their performance over time, which is sent to the autoscaler as metrics to see endpoint metrics , run the following curl https //run vast ai/get endpoint stats/ \\ x post \\ d '{"endpoint" "tgi llama3", "api key" "$api key"}' \\ h 'content type application/json' similarily, to see worker group metrics , execute curl https //run vast ai/get autogroup stats/ \\ x post \\ d '{"endpoint" "tgi llama3", "api key" "$api key"}' \\ h 'content type application/json' load testing in the github repo that we cloned earlier, there is a load testing script called workers/tgi/test load py the n flag indicates the total number of requests to send to the autoscaler, and the rps flag indicates the rate (requests/second) the script will print out statistics that show metrics like total requests currently being generated number of successful generations number of errors total number of workers used during the test to run this script, make sure the python packages from requirements txt are installed, and execute the following command python3 m workers tgi test load n 1000 rps 1 k "$api key" e "tgi llama3" this is everything you need to start, test, and monitor your tgi autoscaler! there are other vast pre made templates , like the comfyui image generation model, that can also be setup in a similar fashion