Skip to main content
When the serverless system recruits a GPU for a {{Worker_Group}}, the PyWorker on the GPU instance starts by conducting a performance test to assess the GPU’s maximum capabilities.

LLMs

For LLMs, this test measures the maximum tokens per second that can be generated across concurrent batches.

Image Generation

For image generation, the model is generating pixels, which does not directly translate to tokens. To translate pixel generation to tokens, the test counts the number of 512x512 pixel grids required to cover the image resolution, considering each grid as equivalent to 175 tokens. This value is added on top of a constant overhead token value of 85. Based on the number of diffusion steps performed, the value is adjusted to accomodate for the request time. The value is then normalized so that a system running Flux on a 4090 GPU achieves a standardized performance rating of 200 tokens per second.
These performance tests may take several minutes to complete, depending on the machine’s specifications. Progress can be monitored through the instance logs. Once the test is completed, the results are saved. If the instance is rebooted, the saved results will be loaded, and the test will not run again. For more details on the full implementation, visit the Vast PyWorker repository and reference backend.py in the lib/ folder of the PyWorker.