Autoscaler
PyWorker
Inside an Autoscaler GPU
6min
all gpu instances on the vast autoscaler contain three parts the core ml model the model server code that handles requests and inferences the ml model the pyworker server code that wraps the ml model, which formats incoming http requests into a compatible format for the model server backend diagram the term 'backend' refers to the machine learning model itself, and the supplementary code used to make its inference work on the vast autoscaler, the only way to access the ml model is through the pyworker that wraps it this allows the pyworker to report accurate metrics to the autoscaler so the system can size the number of gpu instances appropriatley backend configuration once a user has connected to a gpu instance over vast, the backend will start its own launch script the launch script will setup a log file start a webserver to communicate with the ml model and pyworker set environment variables launch the pyworker and create a directory for it monitor the webserver and pyworker processes after launch, the pyworker acts as an inference api server façade, receiving http requests, parsing them, and turning them into internal calls the 'model server' icon in the image above represents the inference runtime this piece loads the model, exposes an interface, performs the model forward pass, and returns the resulting tensors to the pyworker adding endpoints to add an endpoint to an existing backend, follow the instructions in the pyworker extension guide https //docs vast ai/serverless/extension guide this guide can also be used to write new backends authentication the authentication information returned by https //run vast ai/route/ https //docs vast ai/serverless/route must be included in the request json to the pyworker, but will be filtered out before forwarding to the model server for example, a pyworker expects to receive auth data in the request json { "auth data" { "signature" "a base64 encoded signature string from route endpoint", "cost" 256, "endpoint" "your endpoint name", "reqnum" 1234567890, "url" "http //worker ip address\ port" }, "payload" { "inputs" "what is the answer to the universe?", "parameters" { "max new tokens" 256, "temperature" 0 7, "top p" 0 9, "do sample" true } } } once authenticated, the pyworker will forward the following to the model server { "inputs" "what is the answer to the universe?", "parameters" { "max new tokens" 256, "temperature" 0 7, "top p" 0 9, "do sample" true } } when vast’s autoscaling server returns a server address from the /route/ endpoint, it provides a unique signature with your request the authentication server verifies this signature to ensure that only authorized clients can send requests to your server more information for more detailed information and advanced configuration, visit the vast pyworker repository https //github com/vast ai/pyworker/ vast also has pre made backends in our supported templates, which can be found in the autoscaler section here