All GPU instances on the Vast autoscaler contain three parts:

The term 'Backend' refers to the machine learning model itself, and the supplementary code used to make its inference work.

On the Vast autoscaler, the only way to access the ML model is through the PyWorker that wraps it. This allows the PyWorker to report accurate metrics to the autoscaler so the system can size the number of GPU instances appropriatley. 

Once a User has connected to a GPU Instance over Vast, the backend will start its own launch script. The launch script will:

After launch, the PyWorker acts as an inference API server façade, receiving HTTP requests, parsing them, and turning them into internal calls.  

The 'Model Server' icon in the image above represents the inference runtime. This piece loads the model, exposes an interface, performs the model forward pass, and returns the resulting tensors to the PyWorker.

To add an endpoint to an existing backend, follow the instructions in the 

. This guide can also be used to write new backends.

The authentication information returned by 

must be included in the request JSON to the PyWorker, but will be filtered out before forwarding to the model server. For example, a PyWorker expects to receive auth data in the request:

Once authenticated, the PyWorker will forward the following to the model server:

When Vast’s autoscaling server returns a server address from the 

 endpoint, it provides a unique signature with your request. The authentication server verifies this signature to ensure that only authorized clients can send requests to your server.

For more detailed information and advanced configuration, visit the 

Vast also has pre-made backends in our supported templates, which can be found in the Autoscaler section 

Overview

Inside an Autoscaler GPU

A lower level organization that lives within an Endpoint. It consists of a template (with extra filters for search), a set of GPU instances (workers) created from that template, and hyperparameters.

Worker_Groups

The highest level clustering of instances for the autoscaler, consisting of a named endpoint string, a collection of Worker groups, and hyperparameters.

Endpoints

The Vast PyWorker is a Python web server designed to run alongside a machine learning model instance, providing autoscaler compatibility.

PyWorker

Worker_Group

Endpoint

The minimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your group has no load.