Endpoint Parameters
Cold Multiplier
A multiplier applied to your target capacity for longer-term planning (1+ hours). This parameter controls how much extra capacity the serverless engine will plan for in the future compared to immediate needs. For example, if your current target capacity is 100 tokens/sec and cold_mult is 2.0, the engine will plan to have capacity for 200 tokens/sec for longer-term scenarios. This helps ensure your endpoint has sufficient “cold” (stopped but ready) workers available to handle future load spikes without delay. A higher value means more aggressive capacity planning and better preparedness for sudden traffic increases, while a lower value reduces costs from maintaining stopped instances. If not specified during endpoint creation, the default value is 3.Minimum Workers
The minimum number of workers that must be kept in the endpoint at all times. If not specified during endpoint creation, the default value is 5.Max Workers
A hard upper limit on the total number of workers that the endpoint can have at any given time. If not specified during endpoint creation, the default value is 16.Minimum Load
A minimum baseline load (measured in perf units / second) that the serverless engine will be able to handle, regardless of actual measured traffic. This acts as a “floor” for load predictions across all time horizons (1 second to 24+ hours). For example, if your Minimum Load is set to 100 tokens/second, but your endpoint currently has zero traffic, the serverless engine will still plan capacity as if you need to handle at least 100 tokens/second. This prevents the endpoint from scaling down to zero capacity and ensures you’re always ready for incoming requests. If not specified during endpoint creation, the default value is 1.Minimum Cold Load
The minimum baseline load (measured in perf units/second) that the serverless engine will maintain with loaded workers. While Minimum Load ensures a capacity of “Ready” workers, Minimum Cold Load requires a capacity of workers that have fully loaded the model. Workers that count toward this minimum are:- Actively serving requests (status = “Ready”)
- Stopped but ready to serve (status = “Inactive” with model loaded)
Target Utilization
The Target Utilization ratio determines how much spare capacity (headroom) the serverless engine maintains. For example, if your predicted load is 900 tokens/second and target_util is 0.9, the serverless engine will plan for 1000 tokens/second of capacity (900 ÷ 0.9 = 1000), leaving 100 tokens/second (11%) as buffer for traffic spikes. A lower target_util means more headroom:- target_util = 0.9 → 11.1% spare capacity relative to load
- target_util = 0.8 → 25% spare capacity relative to load
- target_util = 0.5 → 100% spare capacity relative to load
- target_util = 0.4 → 150% spare capacity relative to load
Workergroup Parameters
The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.gpu_ram
The amount of GPU memory (VRAM) in gigabytes that your model or workload requires to run. This parameter tells the serverless engine how much GPU memory your model needs. If not specified during workergroup creation, the default value is 24.launch_args
A command-line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers. This allows you to customize instance configuration beyond what’s specified in templates. There is no default value for launch_args.search_params
A query string, list, or dictionary that specifies the hardware and performance criteria for filtering GPU offers in the vast.ai marketplace. It uses a simple query syntax to define requirements for the machines that your Workergroup will consider when searching for workers to create. Example:Python
template_hash
A unique hexadecimal identifier that references a pre-configured template containing all the configuration needed to create instances. Templates are comprehensive specifications that include the Docker image, environment variables, onstart scripts, resource requirements, and other deployment settings. There is no default value for template_hash.template_id
A numeric (integer) identifier that uniquely references a template in the Vast.ai database. This is an alternative way to reference the same template thattemplate_hash points to, but using the template’s database primary key instead of its hash string.
There is no default value for template_id.