Serverless
Architecture
5 min
the vast ai serverless solution manages groups of gpu instances to efficiently serve applications, automatically scaling up or down based on load metrics defined by the vast pyworker it streamlines instance management, performance measurement, and error handling endpoints and worker groups the serverless system needs to be configured at two levels endpoints the highest level clustering of instances for the serverless system, consisting of a named endpoint string, a collection of worker groups, and hyperparameters worker groups a lower level organization that lives within an endpoint it consists of a template (with extra filters for search), a set of gpu instances (workers) created from that template, and hyperparameters multiple worker groups can exist within an endpoint having two level scaling provides several benefits comparing performance metrics across hardware suppose you want to run the same templates on different hardware to compare performance metrics you can create several groups, each configured to run on specific hardware by leaving this setup running for a period of time, you can review the metrics and select the most suitable hardware for your users' needs smooth rollout of a new model if you're using tgi to handle llm inference with llama3 and want to transition to llama4, you can do so gradually for a smooth rollout where only 10% of user requests are handled by llama4, you can create a new worker group under the existing endpoint let it run for a while, review the metrics, and then fully switch to llama4 when ready handling diverse workloads with multiple models you can create an endpoint to manage llm inference using tgi within this group, you can set up multiple worker groups, each using a different llm to serve requests this approach is beneficial when you need a few resource intensive models to handle most requests, while smaller, more cost effective models manage overflow during workload spikes it's important to note that having multiple worker groups within a single endpoint is not always necessary for most users, a single worker group within an endpoint provides an optimal setup you can create worker groups using our serverless compatible templates , which are customized versions of popular templates on vast, designed to be used on the serverless system system architecture the system architecture for an application using vast ai serverless includes the following components serverless system gpu instances user (client application) example workflow a client initiates a request to the serverless system by invoking the https //run vast ai/route/ endpoint the serverless system returns a suitable worker address in the example above, this would be ip address 2 since that gpu instance is 'ready' the client calls the gpu instance's specific api endpoint , passing the authentication info returned by /route/ along with payload parameters the {{pyworker}} on the gpu instance receives the payload and forwards it to the ml model after model inference, the pyworker receives the results the pyworker sends the model results back to the client independently and concurrently, each pyworker in the {{endpoint}} sends its operational metrics to the serverless system, which it uses to make scaling decisions two step routing process this 2 step routing process is used for security and flexibility by having the client send payloads directly to the gpu instances, your payload information is never stored on vast servers the /route/ endpoint signs its messages with a public key available at https //run vast ai/pubkey/ , allowing the gpu worker to validate requests and prevent unauthorized usage