When the serverless system recruits a GPU for a 

, the PyWorker on the GPU instance starts by conducting a performance test to assess the GPU's maximum capabilities. 

For LLMs, this test measures the maximum tokens per second that can be generated across concurrent batches. 

 image generation, the model is generating pixels, which does not directly translate to tokens. To translate pixel generation to tokens, the test counts the number of 512x512 pixel grids required to cover the image resolution, considering each grid as equivalent to 175 tokens. 

This value is added on top of a constant overhead token value of 85. Based on the number of diffusion steps performed, the value is adjusted to accomodate for the request time.

The value is then normalized so that a system running Flux on a 4090 GPU achieves a standardized performance rating of 200 tokens per second.

These performance tests may take several minutes to complete, depending on the machine's specifications. Progress can be monitored through the instance logs. Once the test is completed, the results are saved. If the instance is rebooted, the saved results will be loaded, and the test will not run again.

For more details on the full implementation, visit the 

Creating New PyWorkers

Performance Testing

A lower level organization that lives within an Endpoint. It consists of a template (with extra filters for search), a set of GPU instances (workers) created from that template, and hyperparameters.

Worker_Groups

The highest level clustering of instances for the autoscaler, consisting of a named endpoint string, a collection of Worker groups, and hyperparameters.

Endpoints

The Vast PyWorker is a Python web server designed to run alongside a machine learning model instance, providing autoscaler compatibility.

PyWorker

Worker_Group

Endpoint

The minimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your group has no load.