Skip to main content
The Vast.ai Serverless engine has parameters that allow control over the scaling behavior.

Endpoint Parameters

Cold Multiplier

A multiplier applied to your target capacity for longer-term planning (1+ hours). This parameter controls how much extra capacity the serverless engine will plan for in the future compared to immediate needs. For example, if your current target capacity is 100 tokens/sec and cold_mult is 2.0, the engine will plan to have capacity for 200 tokens/sec for longer-term scenarios. This helps ensure your endpoint has sufficient “cold” (stopped but ready) workers available to handle future load spikes without delay. A higher value means more aggressive capacity planning and better preparedness for sudden traffic increases, while a lower value reduces costs from maintaining stopped instances. If not specified during endpoint creation, the default value is 3.

Minimum Workers

The minimum number of workers that must be kept in the endpoint at all times. If not specified during endpoint creation, the default value is 5.

Max Workers

A hard upper limit on the total number of workers that the endpoint can have at any given time. If not specified during endpoint creation, the default value is 16.

Minimum Load

A minimum baseline load (measured in perf units / second) that the serverless engine will be able to handle, regardless of actual measured traffic. This acts as a “floor” for load predictions across all time horizons (1 second to 24+ hours). For example, if your Minimum Load is set to 100 tokens/second, but your endpoint currently has zero traffic, the serverless engine will still plan capacity as if you need to handle at least 100 tokens/second. This prevents the endpoint from scaling down to zero capacity and ensures you’re always ready for incoming requests. If not specified during endpoint creation, the default value is 1.

Minimum Cold Load

The minimum baseline load (measured in perf units/second) that the serverless engine will maintain with loaded workers. While Minimum Load ensures a capacity of “Ready” workers, Minimum Cold Load requires a capacity of workers that have fully loaded the model. Workers that count toward this minimum are:
  • Actively serving requests (status = “Ready”)
  • Stopped but ready to serve (status = “Inactive” with model loaded)
These workers can start serving requests within seconds because they don’t need to download the model or benchmark GPU performance. This parameter is particularly useful for maintaining low-latency response times during traffic spikes or after periods of low activity. If not specified during endpoint creation, the default value is 0.

Target Utilization

The Target Utilization ratio determines how much spare capacity (headroom) the serverless engine maintains. For example, if your predicted load is 900 tokens/second and target_util is 0.9, the serverless engine will plan for 1000 tokens/second of capacity (900 ÷ 0.9 = 1000), leaving 100 tokens/second (11%) as buffer for traffic spikes. A lower target_util means more headroom:
  • target_util = 0.9 → 11.1% spare capacity relative to load
  • target_util = 0.8 → 25% spare capacity relative to load
  • target_util = 0.5 → 100% spare capacity relative to load
  • target_util = 0.4 → 150% spare capacity relative to load
If not specified during endpoint creation, the default value is 0.9.

Workergroup Parameters

The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.

gpu_ram

The amount of GPU memory (VRAM) in gigabytes that your model or workload requires to run. This parameter tells the serverless engine how much GPU memory your model needs. If not specified during workergroup creation, the default value is 24.

launch_args

A command-line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers. This allows you to customize instance configuration beyond what’s specified in templates. There is no default value for launch_args.

search_params

A query string, list, or dictionary that specifies the hardware and performance criteria for filtering GPU offers in the vast.ai marketplace. It uses a simple query syntax to define requirements for the machines that your Workergroup will consider when searching for workers to create. Example:
Python
{"verified": {"eq": true}, "rentable": {"eq": true}, "rented": {"eq": false}}
There is no default value for search_params. To see all available search filters, see the CLI docs here.

template_hash

A unique hexadecimal identifier that references a pre-configured template containing all the configuration needed to create instances. Templates are comprehensive specifications that include the Docker image, environment variables, onstart scripts, resource requirements, and other deployment settings. There is no default value for template_hash.

template_id

A numeric (integer) identifier that uniquely references a template in the Vast.ai database. This is an alternative way to reference the same template that template_hash points to, but using the template’s database primary key instead of its hash string. There is no default value for template_id.