The PyWorker - Vast.ai Documentation: Affordable GPU Cloud Marketplace

The Vast PyWorker is a Python web server designed to run alongside a machine learning model instance, providing serverless compatibility. It serves as the primary entry point for API requests, forwarding them to the model’s API hosted on the same instance. Additionally, it monitors performance metrics and estimates current workload, reporting these metrics to the serverless system.

All of Vast’s serverless templates use the Vast PyWorker. If you are using a recommended serverless template from Vast, the PyWorker is already integrated with the template and will automatically startup when a Workergroup is created.

In the diagram’s example, a user’s client is attempting to infer from a machine learning model. With Vast’s Serverless setup, the client:

Sends a /route/ POST request to the serverless engine. This asks the system for a GPU instance to send the inference request.
The serverless system selects a ready and available worker instance from the user’s endpoint and replies with a JSON object containing the URL of the selected instance.
The client then constructs a new POST request with it’s payload, authentication data, and the URL of the worker instance. This is sent to the worker.
The PyWorker running on that specific instance validates the request and extracts the payload. It then sends the payload to the model inference server, which runs on the same instance as the PyWorker.
The model generates it’s output and returns the result to the PyWorker.
The PyWorker formats the model’s response as needed, and sends the response back to the client.
Independently and concurrently, the PyWorker periodically sends it’s operational metrics to the serverless system, which is used to make scaling decisions.

The Vast PyWorker repository gives examples that are useful for learning how to create a custom PyWorker for your custom template and integrate with Vast’s Serverless system. Even with a custom PyWorker, the PyWorker code runs on your Vast instance, and we automate its installation and activation during instance creation. The graphic below shows how the files and entities for the Serverless system are organized.

Integration with Model Instance

The Vast PyWorker wraps the backend code of the model instance you are running. The PyWorker calls the appropriate backend function when the PyWorker’s corresponding API endpoint is invoked. For example, if you are running a text generation inference (TGI) server, your PyWorker might receive the following JSON body from a /generate endpoint:

JSON

{
  "auth_data": {
    "signature": "a_base64_encoded_signature_string_from_route_endpoint",
    "cost": 256,
    "endpoint": "Your-TGI-Endpoint-Name",
    "reqnum": 1234567890,
    "url": "http://worker-ip-address:port",
    "request_idx": 10203040
  },
  "payload": {
    "inputs": "What is the answer to the universe?",
    "parameters": {
      "max_new_tokens": 256,
      "temperature": 0.7,
      "top_p": 0.9,
      "do_sample": true
    }
  }
}

When it receives this request, your PyWorker will internally send the following to the TGI model sever:

JSON

{
  "inputs": "What is the answer to the universe?",
  "parameters": {
    "max_new_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.9,
    "do_sample": true
  }
}

Your PyWorker would similarily receive the output result from the TGI server, and forward a formatted version to the client.

​Integration with Model Instance

Integration with Model Instance