Endpoints and Worker Groups
The Serverless system needs to be configured at two levels:- Endpoints: The highest level clustering of instances for the Serverless system, consisting of a named endpoint string, a collection of Worker groups, and hyperparameters.
- Worker Groups: A lower level organization that lives within an Endpoint. It consists of a template (with extra filters for search), a set of GPU instances (workers) created from that template, and hyperparameters. Multiple Worker Groups can exist within an Endpoint.
- Comparing Performance Metrics Across Hardware: Suppose you want to run the same templates on different hardware to compare performance metrics. You can create several groups, each configured to run on specific hardware. By leaving this setup running for a period of time, you can review the metrics and select the most suitable hardware for your users’ needs.
- Smooth Rollout of a New Model: If you’re using TGI to handle LLM inference with LLama3 and want to transition to LLama4, you can do so gradually. For a smooth rollout where only 10% of user requests are handled by LLama4, you can create a new Worker Group under the existing Endpoint. Let it run for a while, review the metrics, and then fully switch to LLama4 when ready.
- Handling Diverse Workloads with Multiple Models: You can create an Endpoint to manage LLM inference using TGI. Within this group, you can set up multiple Worker Groups, each using a different LLM to serve requests. This approach is beneficial when you need a few resource-intensive models to handle most requests, while smaller, more cost-effective models manage overflow during workload spikes.
System Architecture
The system architecture for an application using Vast.ai Serverless includes the following components:- Serverless System
- GPU Instances
- User (Client Application)

Serverless Architecture
Example Workflow
- A client initiates a request to the Serverless system by invoking the
https://run.vast.ai/route/
endpoint. - The Serverless system returns a suitable worker address. In the example above, this would be
ip_address_2
since that GPU instance is ‘Ready’. - The client calls the GPU instance’s specific API endpoint, passing the authentication info returned by
/route/
along with payload parameters. - The PyWorker on the GPU instance receives the payload and forwards it to the ML model. After model inference, the PyWorker receives the results.
- The PyWorker sends the model results back to the client.
- Independently and concurrently, each PyWorker in the Endpoint sends its operational metrics to the Serverless system, which it uses to make scaling decisions.
Two-Step Routing Process
This 2-step routing process is used for security and flexibility. By having the client send payloads directly to the GPU instances, your payload information is never stored on Vast servers. The/route/
endpoint signs its messages with a public key available at https://run.vast.ai/pubkey/
, allowing the GPU worker to validate requests and prevent unauthorized usage.