> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vast.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Managing Scale

> Learn how to configure your Serverless endpoint for different load scenarios

## Initial Rollout

The serverless engine learns the cost-vs-performance profile of each GPU class in your `search_params` from real workers running real traffic (see [Choosing GPUs](/guides/serverless/choosing-gpus)). The speed at which it "settles" into the most cost-effective mix depends on how quickly workers are recruited and released, so it helps to **apply a test load during the first day of operation** to give the engine enough signal to converge.

Best practice is to scale to **double the number of expected required workers, then back down, three separate times.**

### Simulating Load

For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository:

[https://github.com/vast-ai/vast-sdk/blob/main/examples/client/vllm\_load\_example.py](https://github.com/vast-ai/vast-sdk/blob/main/examples/client/vllm_load_example.py)

## Managing for Bursty Load

* **Adjust** `min_workers`: This will change the number of managed inactive workers, and increase capacity for high peak
* **Check** `max_workers`: Ensure this parameter is set high enough for the serverless engine to create the necessary number of workers

## Managing for Low Demand or Idle Periods

* **Adjust** `min_load`: Reducing `min_load` will reduce the minimum number of active workers. Set to `1` to reduce the number to its minimum value of 1 worker, or set to `0` to put all workers into inactive states.
* **Adjust** `min_workers`: This will change the number of managed inactive workers

## Scaling to Zero

To allow your endpoint to fully scale to zero during idle periods, configure `inactivity_timeout` alongside your other scaling parameters. The `inactivity_timeout` value (in seconds) determines how long the endpoint must be idle before scaling down is permitted.

* To scale to **zero active workers** (while keeping cold workers available): set `min_load = 0` and configure a positive `inactivity_timeout`. Workers in the `cold_workers` pool will remain available for fast reactivation.
* To scale to **zero total workers**: set `min_load = 0`, `cold_workers = 0`, and configure a positive `inactivity_timeout`. This minimizes cost during extended idle periods but incurs cold-start latency when traffic resumes.
* To **prevent** scaling to zero regardless of other settings: set `inactivity_timeout` to a negative value (e.g., `-1`).

A value of `0` for `inactivity_timeout` disables inactivity-based gating entirely, the endpoint will rely solely on normal autoscaling decisions.

## Managing Queue Time

Use `max_queue_time` and `target_queue_time` to control how the autoscaler responds to request queuing:

* **Increase** `max_queue_time` to allow more requests to buffer on each worker before the system holds them in the global queue. This is useful for workloads with predictable, longer processing times.
* **Decrease** `target_queue_time` to trigger more aggressive scale-up when queue times rise, reducing latency at the cost of potentially higher worker counts.
* **Increase** `target_queue_time` to tolerate higher queue times before scaling up, reducing costs when some latency is acceptable.
