TheDocumentation Index
Fetch the complete documentation index at: https://docs.vast.ai/llms.txt
Use this file to discover all available pages before exploring further.
vastai run benchmarks CLI command rents one instance of each GPU class you give it, runs the template’s built-in benchmark workload, and reports performance per dollar. Multiple GPUs run in parallel and tear themselves down when finished.
When to Use It
- Picking which GPU class to rent for an on-demand instance, or which classes to allow in a Serverless Workergroup’s
search_params. - Comparing two templates (for example, vLLM vs. TGI for the same model) on the same hardware.
- Validating that a template fits and runs correctly on a GPU before committing to longer rentals or production traffic.
- Producing a perf/dollar number for capacity planning or budgeting.
Prerequisites
- A Vast.ai account with credits.
- An API key configured (see Authentication).
- The template’s hash or ID, from the Templates dashboard or
vastai search templates.
Basic Usage
If you don’t pass--gpus, the CLI sweeps a built-in default set (RTX 5090, RTX 4090, RTX 3090, RTX A6000), which is a reasonable starting point if you’re not sure which classes to compare:
Nx prefix to request multi-GPU configurations on a per-token basis:
How It Works
For each GPU in your list, the CLI pre-flights the marketplace against the template’sextra_filters (skipping GPUs with no matching offers and reporting which filter excluded them), creates a scratch endpoint and one-worker Workergroup, polls until the worker reaches status=idle with a positive measured_perf, then records the result and tears the rental down. The rental’s actual dph_total is fetched at idle so perf/dollar reflects the real run, not a marketplace estimate.
GPU count per rental comes from the Nx token prefix if set, otherwise --num_gpus, otherwise auto-sized from the template’s gpu_total_ram filter.
Reading the Output
While the run is in progress, the CLI renders a live table:| Column | Meaning |
|---|---|
| GPU | The GPU class being benchmarked. |
| Status | Current worker state: queued, provisioning, waiting_for_worker, loading, idle, done, failed, timeout, no_worker, skipped, aborted, error. |
| Endpoint | The ephemeral endpoint ID created for this run. |
| Worker | The worker (instance) ID, once one has been recruited. |
| Elapsed | Time since the worker started running. Freezes on terminal status. |
| Perf | The template’s measured_perf (workload-units per second; tokens/sec for typical LLMs, requests/sec when the template has no custom workload calculator). Useful for ranking GPUs on the same template, not for cross-template comparison. |
| $/hr | The rented worker’s dph_total, the hourly rate the contract is being billed at. |
| Perf/$/hr | Cost-efficiency score: measured performance divided by hourly price. Higher is better. |
Perf/$/hr, so the most cost-efficient GPU for your workload is at the top. With --raw, the same data is emitted as JSON for scripting:
Cost and Timeout
Each GPU rental runs for up to--timeout seconds (default 3600). The CLI prints what it’s about to do, the number of configurations, the GPU mix, and the per-rental timeout, then asks for confirmation. Pass -y to skip the prompt in scripts. Real runs almost always finish well before the timeout because the runner exits as soon as it reads a valid measured_perf.
You can tighten the ceiling for cheaper runs:
Common Outcomes
| Status | What it means | What to do |
|---|---|---|
ok | Benchmark completed and reported measured_perf. | Use the result. |
skipped | No marketplace offer matched after applying the template’s filters. | The CLI prints which filter blocked the GPU. Loosen extra_filters on the template, or pick a different GPU. |
no_worker | The autoscaler did not rent any instance within 120 seconds. | Often a scoring or template+GPU mismatch the pre-flight missed. Try a different GPU or relax filters. |
failed | Workers reached a terminal state (stopped, destroying, unavail) without ever becoming idle. | Inspect worker logs in the dashboard. Common causes are model download failure or an OOM during load. |
timeout | The worker was still loading or running when --timeout elapsed. | Increase --timeout, or check whether the host is unusually slow (Docker pull stalls are the typical culprit). |
Full Flag Reference
See thevastai run benchmarks reference for every flag and its default.
If you’re operating a Serverless Workergroup, the autoscaler already runs this same benchmark on every worker it recruits and uses the results to drive its own GPU choices. See Automated Performance Testing for how that works.