Serverless
Serverless Parameters
11 min
the vast ai serverless system has parameters that allow control over the scaling behavior {{endpoint}} parameters cold mult a multiplier applied to your target capacity for longer term planning (1+ hours) this parameter controls how much extra capacity the serverless engine will plan for in the future compared to immediate needs for example, if your current target capacity is 100 tokens/sec and cold mult is 2 0, the system will plan to have capacity for 200 tokens/sec for longer term scenarios this helps ensure your endpoint has sufficient "cold" (stopped but ready) workers available to handle future load spikes without delay a higher value means more aggressive capacity planning and better preparedness for sudden traffic increases, while a lower value reduces costs from maintaining stopped instances if not specified during endpoint creation, the default value is 2 5 cold workers the minimum number of workers that must be kept in a "ready quick" state before the serverless engine is allowed to destroy any workers a worker is considered "ready quick" if it's either \ actively serving (status = "idle" with model loaded) \ stopped but ready (status = "stopped" with model loaded) cold workers are not shut down, they are stopped but have the model fully loaded this means they can start serving requests very quickly (seconds) without having to re download the model or benchmark the gpu performance if not specified during endpoint creation, the default value is 5 max workers a hard upper limit on the total number of worker instances (ready, stopped, loading, etc ) that your endpoint can have at any given time if not specified during endpoint creation, the default value is 20 min load a minimum baseline load (measured in tokens/second for llms) that the serverless system will assume your endpoint needs to handle, regardless of actual measured traffic this acts as a "floor" for load predictions across all time horizons (1 second to 24+ hours), ensuring your endpoint maintains minimum capacity even during periods of zero or very low traffic for example, if your min load is set to 100 tokens/second, but your endpoint currently has zero traffic, the serverless system will still plan capacity as if you need to handle at least 100 tokens/second this prevents the endpoint from scaling down to zero capacity and ensures you're always ready for incoming requests if not specified during endpoint creation, the default value is 10 target util the target utilization ratio determines how much spare capacity (headroom) the serverless system maintains for example, if your predicted load is 900 tokens/second and target util is 0 9, the serverless engine will plan for 1000 tokens/second of capacity (900 ÷ 0 9 = 1000), leaving 100 tokens/second (11%) as buffer for traffic spikes a lower target util means more headroom \ target util = 0 9 → 11 1% spare capacity relative to load \ target util = 0 8 → 25% spare capacity relative to load \ target util = 0 5 → 100% spare capacity relative to load \ target util = 0 4 → 150% spare capacity relative to load if not specified during endpoint creation, the default value is 0 9 {{workergroup}} parameters the following parameters can be specified specifically for a workergroup and override endpoint parameters the endpoint parameters will continue to apply for other workergroups contained in it, unless specifically set min load target util cold mult the parameters below are specific to only workergroups, not endpoints gpu ram the amount of gpu memory (vram) in gigabytes that your model or workload requires to run this parameter tells the serverless engine how much gpu memory your model needs if not specified during workergroup creation, the default value is 24 launch args a command line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers this allows you to customize instance configuration beyond what's specified in templates there is no default value for launch args search params a query string, list, or dictionary that specifies the hardware and performance criteria for filtering gpu offers in the vast ai marketplace it uses a simple query syntax to define requirements for the machines that your workergroup will consider when searching for workers to create example {"verified" {"eq" true}, "rentable" {"eq" true}, "rented" {"eq" false}} there is no default value for search params to see all available search filters, see the cli docs here template hash a unique hexadecimal identifier that references a pre configured template containing all the configuration needed to create instances templates are comprehensive specifications that include the docker image, environment variables, onstart scripts, resource requirements, and other deployment settings there is no default value for template hash template id a numeric (integer) identifier that uniquely references a template in the vast ai database this is an alternative way to reference the same template that template hash points to, but using the template's database primary key instead of its hash string there is no default value for template id test workers the number of different physical machines that a {{workergroup}} should test during its initial "exploration" phase to gather performance data before transitioning to normal demand based scaling the worker group remains in "exploring" mode until it has successfully tested at least floor(test workers / 2) machines if not specified during workergroup creation, the default value is 3