Autoscaler

Architecture

5min
the vast ai autoscaler manages groups of instances to efficiently serve applications, automatically scaling up or down based on load metrics defined by the vast pyworker it streamlines instance management, performance measurement, and error handling endpoint groups and autoscaling groups the autoscaler needs to be configured at two levels endpoint groups these represent a collection of autoscaling groups that collectively handle the same api requests autoscaling groups these define the configuration of machines running the code to serve the endpoint multiple autogroups can exist within an endpoint group, with the autoscaler monitoring performance and optimizing instance management as needed having two level autoscaling provides several benefits here are some examples to help illustrate this comparing performance metrics across hardware suppose you want to run the same templates on different hardware to compare performance metrics you can create several autoscaling groups, each configured to run on specific hardware by leaving this setup running for a period of time, you can review the metrics and select the most suitable hardware for your users' needs smooth rollout of a new model if you're using tgi to handle llm inference with llama2 and want to transition to llama3, you can do so gradually for a smooth rollout where only 10% of user requests are handled by llama3, you can create a new autoscaling group under the existing endpoint group let it run for a while, review the metrics, and then fully switch to llama3 when ready handling diverse workloads with multiple models you can create an endpoint group to manage llm inference using tgi https //github com/huggingface/text generation inference within this group, you can set up multiple autogroups, each using a different llm to serve requests this approach is beneficial when you need a few resource intensive models to handle most requests, while smaller, more cost effective models manage overflow during workload spikes it's important to note that having multiple autogroups within a single endpoint group is not always necessary for most users, a single autogroup within an endpoint group provides an optimal setup you can create autogroups using our autoscaler compatible templates https //docs vast ai/serverless/templates reference , which are customized versions of popular templates on vast system architecture the system architecture for an application using vast ai autoscaling includes the following components autoscaler (vast ai) load balancer (vast ai) gpu worker code (customize using our pyworker framework https //github com/vast ai/pyworker/tree/main and examples) application website (your responsibility) autoscaler diagram example workflow for a consumer llm app a customer initiates a request through your website your website calls https //run vast ai/route/ with your endpoint, api key, and any optional parameters (e g , cost) the /route/ endpoint https //docs vast ai/serverless/route returns a suitable worker address your website calls the gpu worker's specific api endpoint https //docs vast ai/serverless/backends , like {worker address}/generate , passing the info returned by /route/ along with request parameters (e g , prompt) your website returns the results to the client's browser or handles them as needed two step routing process this 2 step routing process is used for simplicity and flexibility it ensures that you don't need to route all user data through a central server provided by vast ai, as our load balancer doesn't require those details to route your request the /route/ endpoint signs its messages with a public key available at https //run vast ai/pubkey/ , allowing the gpu worker to validate requests and prevent unauthorized usage in the future, we may add an optional proxy service to reduce this to a single step for a detailed walkthrough of llm inference using huggingface tgi as the worker backend, refer to this guide https //docs vast ai/serverless/getting started