LLMs
For LLMs, this test measures the maximum tokens per second that can be generated across concurrent batches.Image Generation
For image generation, the model is generating pixels, which does not directly translate to tokens. To translate pixel generation to tokens, the test counts the number of 512x512 pixel grids required to cover the image resolution, considering each grid as equivalent to 175 tokens. This value is added on top of a constant overhead token value of 85. Based on the number of diffusion steps performed, the value is adjusted to accomodate for the request time. The value is then normalized so that a system running Flux on a 4090 GPU achieves a standardized performance rating of 200 tokens per second.These performance tests may take several minutes to complete, depending on the machine’s specifications. Progress can be monitored through the instance logs. Once the test is completed, the results are saved. If the instance is rebooted, the saved results will be loaded, and the test will not run again. For more details on the full implementation, visit the Vast PyWorker repository and reference
backend.py
in the lib/
folder of the PyWorker.