Autoscaler
Architecture
5min
the vast ai autoscaler manages groups of gpu instances to efficiently serve applications, automatically scaling up or down based on load metrics defined by the vast pyworker it streamlines instance management, performance measurement, and error handling endpoints and worker groups the autoscaler needs to be configured at two levels endpoints the highest level clustering of instances for the autoscaler, consisting of a named endpoint string, a collection of worker groups, and hyperparameters worker groups a lower level organization that lives within an endpoint it consists of a template (with extra filters for search), a set of gpu instances (workers) created from that template, and hyperparameters multiple worker groups can exist within an endpoint having two level autoscaling provides several benefits comparing performance metrics across hardware suppose you want to run the same templates on different hardware to compare performance metrics you can create several autoscaling groups, each configured to run on specific hardware by leaving this setup running for a period of time, you can review the metrics and select the most suitable hardware for your users' needs smooth rollout of a new model if you're using tgi to handle llm inference with llama3 and want to transition to llama4, you can do so gradually for a smooth rollout where only 10% of user requests are handled by llama4, you can create a new autoscaling group under the existing endpoint let it run for a while, review the metrics, and then fully switch to llama4 when ready handling diverse workloads with multiple models you can create an endpoint to manage llm inference using tgi within this group, you can set up multiple worker groups, each using a different llm to serve requests this approach is beneficial when you need a few resource intensive models to handle most requests, while smaller, more cost effective models manage overflow during workload spikes it's important to note that having multiple worker groups within a single endpoint is not always necessary for most users, a single worker group within an endpoint provides an optimal setup you can create worker groups using our autoscaler compatible templates https //docs vast ai/serverless/templates reference , which are customized versions of popular templates on vast system architecture the system architecture for an application using vast ai autoscaling includes the following components autoscaler gpu instances user (client application) example workflow a client initiates a request to the autoscaler by invoking the https //run vast ai/route/ endpoint the autoscaler returns a suitable worker address in the example above, this would be ip address 2 since that gpu instance is 'ready' the client calls the gpu instance's specific api endpoint , passing the authentication info returned by /route/ along with payload parameters the {{pyworker}} on the gpu instance receives the payload and forwards it to the ml model after model inference, the pyworker receives the results the pyworker sends the model results back to the client independently and concurrently, each pyworker in the {{endpoint}} sends its operational metrics to the autoscaler, which the autoscaler uses to make scaling decisions two step routing process this 2 step routing process is used for security and flexibility by having the client send payloads directly to the gpu instances, your payload information is never stored on vast servers the /route/ endpoint signs its messages with a public key available at https //run vast ai/pubkey/ , allowing the gpu worker to validate requests and prevent unauthorized usage